I did non trivial work with apache spark dataframes and came to appreciate them before ever being exposed to Pandas. After spark, pandas just seemed frustrating and incomprehensible. Polars is much more like spark and I am very happy about that.
DuckDb even goes so far as to include a clone of the pyspark dataframe API, so somebody there must like it too.
I had a similar experience with spark, especially in the Scala API it felt very expressive and concise once I got used to certain idioms. Also +1 on duckdb which is excellent.
There are some frustrations in spark however, I remember getting stuck on Winsorizing over groups. Hilariously there are identical functions called `percentile_approx` and `approx_percentile` and it wasn't clear from the docs they were the same or at least did the same thing.
Given all that, the ergonomics of Julia for general purpose data handling is really unmatched IMO. I've got a lot of clean and readable data pipeline and shaping code that I revisited a couple years later and could easily understand. And making updates with new more type-generic functions is a breeze. Very enjoyable.
yeah i couldnt get it done in spark api had to combine spark and spark sql bc the window function i needed was (probably) not available in spark. it was inelegant i thought.
I have not worked with Spark, but I have used Athena/Trino and BigQuery extensively.
For me I don't really understand the hype around Polars, other than that it fixes some annoying issues with the Pandas API by sacrificing backwards compatibility.
With a single node engine you have a ceiling how good it can get.
With Spark/Athena/BigQuery the sky is the limit. It is such a freedom to not be limited by available RAM or CPU. They just scale to what they need. Some queryies squeeze in CPU-days in just a few minutes.
I don't know how well the polars implementation works, but what I love about PySpark is that sometimes spark is able to push those groupings down to the database. Not always, but sometimes. However I imagine that many people love polars/pandas performance for transactional queries (from start to finish get me a result in less than a second (as long as the number of underlying rows is not greater than 20k-ish). Pyspark will never be super great for that.
Pandas sat alone in the Python ecosphere for a long time. Lack of competition is generally not a good thing. I'm thrilled to have Polars around to innovate on the API end (and push Pandas to be better).
And I say this as someone who makes much of their living from Pandas.
I think pandas is well aware of some of the unfortunate legacy API decisions without Polars. They are trapped by backwards compatibility. Wes’ “Things I Hate About Pandas” post covers the highlights. Most of which boils down to having not put a layer between numpy and pandas. Which is why they were stuck with the unfortunate integer null situation.
Which is all stuff they could fix, if they'd be willing to, with a major version bump. They'd need a killer feature to encourage that migration though.
The really brutal thing is all of the code using Pandas written by researchers and non-software engineers running quietly in lab environments. Difficult to reproduce environments, small or non-existent test suites, code written by grad students long gone. If the Pandas interface breaks for installs done via `pip install pandas` it will cause a lot of pain.
With that acknowledged, it'll make life a lot easier on everyone if the "fix the API" Pandas 3 had a different package name. Polars and others seem like exactly that solution, even if not literally Pandas.
I've wanted to convert a massive Pandas codebase to Polars for a long time. Probably 90% of the compute time is Pandas operations, especially creating new columns / resizing dataframes (which I understand to involve less of a speed difference compared to the grouping operations mentioned in the post, but still substantial). Anyone had success doing this and found it to be worth the effort?
It's the opposite; I prefer DuckDB and generally work with DuckDB's friendly SQL interface. SQL is declarative and is (for me) more intuitive than method-chaining -- especially for complex analytic operations that happen in one go.
(software people might beg to differ about the intuitive bit because they are more used to an imperative style, and to my surprise, even the best software engineers struggle with SQL because it requires one to think in set and relation operations rather than function calls, which many software folks are not used to)
I actually don't use the Polars dataframe APIs much except for some operations which are easier to do in dataframe form, like applying a Python function as UDF, or transposing (not pivoting) a dataframe.
Also Polars is good for materializing the query into a dataframe rapidly, which can then be passed into methods/functions. It's also a lot easier to unit test dataframes than SQL tables. There's a lot more tooling for that.
The difference is a sanely and presciently designed expression API, which is a bit more verbose in some common cases, but is more predictable and much more expressive in more complex situations like this.
On a tangent, i wonder what this op would look like in SQL? Probably would need support for filtering in a window function, which I'm not sure is standardized?
But also props to Wes McKinney for giving us a dataframe library during a time when we had none. Java still doesn’t have a decent dataframe library so we mustn’t take these things for granted.
The Pandas API is no longer the way things should be done today nor should it be in new tutorials. Pandas was the jquery of its time —- great but no longer the state of the art. But I have much gratitude for it being around when it was needed.
-- "find the maximum value of 'views',
-- where 'sales' is greater than its mean, per 'id'".
select max(views), id -- "find the maximum value of 'views',
from example_table as et
where exists
(
SELECT *
FROM
(
SELECT id, avg(sales) as mean_sales
FROM example_table
GROUP by id
) as f --
where et.sales > f.mean_sales -- where 'sales' is greater than its mean
and et.id = f.id
)
group by id; -- per 'id'".
The power of having an API that allows usage of the Free monad.
And in less-funny-FP-speak, the power of allowing the user write a program (expressions), that the sufficiently-smart backend later compiles/interprets.
Awesome! Didn't expect such a vast difference in usability at first.
If I'm doing some data science just for fun and personal projects, is there any reason to not go with Polars?
I took some data science classes in grad school, but basically haven't had any reason to touch pandas since I graduated. But, did like the ecosystem of tools, learning materials, and other libraries surrounding it when I was working with it. I recently just started a new project and am quickly going through my old notes to refamiliarize myself with pandas, but maybe I should just go and learn Polars?
Pandas can be easier for interactive poking around on smaller data sets. It feels more like Excel, whereas Polars feels more like SQL. It's no surprise that on a forum full of programmers Polars is widely preferred, but IMO it's worth at least trying both.
I adore polars for its speed and I find the interface easier than pandas. But pandas still has a richer ecosystem of stuff built around it. I try to use polars in greenfield things but occasionally get yoinked back.
Does anyone have a good heuristic for when a dataframe library is a good tool choice? I work on a team that has a lot of data scientists and a few engineers (including myself) and I often see the data scientists using dataframes when simple python classes would be much more appropriate so that you have a better sense of the object you're working with. I'm been having a hard time getting this idea across to people though.
Performance is my heuristic. I can't make it quantitative, because 100M records in 1 minute might be considered fast for some use cases, but slow for others. For me it's the qualitative "is this thing too slow?".
Personally, I see a dataframe library as a last resort. I prefer to improve the algorithm, or push more things into the database, or even scale up the hardware in some cases. If I've exhausted all other options and it's still too slow, then I use a dataframe library.
But then I'm not a data scientist. I've found that data scientists have a hammer that they really really like, so they'll use it everywhere.
Frankly, the heuristic I've developed over the past few years working on a team that sounds like yours is: The data scientists are probably right.
If you're actually operating on an object, ie. the equivalent to a single row in a dataframe, then yeah, it's silly to use a dataframe library. But if you're operating on N value objects ... yeah you probably want a dataframe with N rows and a column for each field in your object.
Your mileage may vary I guess, but I resisted this for quite awhile and I now think I was the one who was wrong.
DataFrames are easy to use, everyone knows how to use them, you can move fast, and it's easy to iterate and test differences between things, and reviewing the code is a breeze.
That said, my team moved to polars about a year ago and we haven't looked back.
I have the opposite opinion. In a previous codebase I fought hard to use dataclasses & type hinting where possible over dictionaries, because with dictionaries you'd never know what type anything was, or what keys were present. That worked nicely and it was much easier to understand the codebase.
Now I've been put on a Pandas project and it's full of mysterious
df = df[df["thing"] == "value"]
I just feel like we've gone back to the unreadability of dictionaries.
Everything's just called "df", you never know what type anything is without going in and checking, the structure of the frames is completely opaque, they change the structure of the dataframe halfway through the program. Type hinting these things is much harder than TypedDict/dataclass, at least doing it correctly & unambiguously is. It's practically a requirement to shove this stuff in a debugger/REPL because you'd have no chance otherwise.
Sure, the argument is that I'm just in a bad Pandas codebase, and it can be done much better. However what I take issue with is that this seems to be the overwhelming "culture" of Pandas. All Pandas code I've ever read is like this. If you look at tutorials, examples online, you see the same stuff. They all call everything the same name and program in the most dynamic & opaque fashion possible. Sure it's quick to write, and if you love Pandas you're used to it, but personally I wince every time I look in a method and see this stuff instead of normal code.
Personally I only use Pandas if I absolutely need it for performance, as a last resort.
Why would you generate SQL using another programming language? To me that sounds like something you'd only do if you're deep in an ORM with no escape hatch. For data analysis tasks, that's extremely unergonomic and you should definitely just write normal SQL. Use the ORM for CRUD. I've never seen an ORM that won't let you drop down to regular SQL for ad-hoc queries.
Editor completion is an extremely low ranking aspect for choosing technologies for data analysis. If SQL is the better tool but you're not using it because it doesn't have editor completion, then you need a better editor. It pains me when people prioritise "developer experience in VS Code" over "actually the correct technological choice".
I’ve moved mostly to polars. I still have some frameworks that demand pandas and pandas is still a very solid dataframe, but when I need to interpolate months in millions of lines of quarterly data, polars just blows it away.
Even better is using tools like Narwhals and Ibis which can convert back and forth to any frames you want.
DuckDb even goes so far as to include a clone of the pyspark dataframe API, so somebody there must like it too.
There are some frustrations in spark however, I remember getting stuck on Winsorizing over groups. Hilariously there are identical functions called `percentile_approx` and `approx_percentile` and it wasn't clear from the docs they were the same or at least did the same thing.
Given all that, the ergonomics of Julia for general purpose data handling is really unmatched IMO. I've got a lot of clean and readable data pipeline and shaping code that I revisited a couple years later and could easily understand. And making updates with new more type-generic functions is a breeze. Very enjoyable.
For me I don't really understand the hype around Polars, other than that it fixes some annoying issues with the Pandas API by sacrificing backwards compatibility.
With a single node engine you have a ceiling how good it can get.
With Spark/Athena/BigQuery the sky is the limit. It is such a freedom to not be limited by available RAM or CPU. They just scale to what they need. Some queryies squeeze in CPU-days in just a few minutes.
And I say this as someone who makes much of their living from Pandas.
With that acknowledged, it'll make life a lot easier on everyone if the "fix the API" Pandas 3 had a different package name. Polars and others seem like exactly that solution, even if not literally Pandas.
My book, Effective Polars, has a whole chapter devoted to the question of whether porting from pandas makes sense.
However there are subtle differences between Pandas and Polars behaviors so regression testing is your friend. It’s not 1:1 mapping.
I go back and forth between DuckDB and Polars functions in the same scope because it's so cheap to convert between the two.
(software people might beg to differ about the intuitive bit because they are more used to an imperative style, and to my surprise, even the best software engineers struggle with SQL because it requires one to think in set and relation operations rather than function calls, which many software folks are not used to)
I actually don't use the Polars dataframe APIs much except for some operations which are easier to do in dataframe form, like applying a Python function as UDF, or transposing (not pivoting) a dataframe.
Also Polars is good for materializing the query into a dataframe rapidly, which can then be passed into methods/functions. It's also a lot easier to unit test dataframes than SQL tables. There's a lot more tooling for that.
On a tangent, i wonder what this op would look like in SQL? Probably would need support for filtering in a window function, which I'm not sure is standardized?
But also props to Wes McKinney for giving us a dataframe library during a time when we had none. Java still doesn’t have a decent dataframe library so we mustn’t take these things for granted.
The Pandas API is no longer the way things should be done today nor should it be in new tutorials. Pandas was the jquery of its time —- great but no longer the state of the art. But I have much gratitude for it being around when it was needed.
I disagree that Pandas is no longer state of the art: its interface is optimized for a different use case compared to Polars.
No need to filter within the window function if you use subquery or CTE, which is supported everywhere.
According to wikipedia, windowing was standardized back in 2003.
Awesome! Didn't expect such a vast difference in usability at first.
I took some data science classes in grad school, but basically haven't had any reason to touch pandas since I graduated. But, did like the ecosystem of tools, learning materials, and other libraries surrounding it when I was working with it. I recently just started a new project and am quickly going through my old notes to refamiliarize myself with pandas, but maybe I should just go and learn Polars?
Personally, I see a dataframe library as a last resort. I prefer to improve the algorithm, or push more things into the database, or even scale up the hardware in some cases. If I've exhausted all other options and it's still too slow, then I use a dataframe library.
But then I'm not a data scientist. I've found that data scientists have a hammer that they really really like, so they'll use it everywhere.
If you're actually operating on an object, ie. the equivalent to a single row in a dataframe, then yeah, it's silly to use a dataframe library. But if you're operating on N value objects ... yeah you probably want a dataframe with N rows and a column for each field in your object.
Your mileage may vary I guess, but I resisted this for quite awhile and I now think I was the one who was wrong.
That said, my team moved to polars about a year ago and we haven't looked back.
I have the opposite opinion. In a previous codebase I fought hard to use dataclasses & type hinting where possible over dictionaries, because with dictionaries you'd never know what type anything was, or what keys were present. That worked nicely and it was much easier to understand the codebase.
Now I've been put on a Pandas project and it's full of mysterious
I just feel like we've gone back to the unreadability of dictionaries.Everything's just called "df", you never know what type anything is without going in and checking, the structure of the frames is completely opaque, they change the structure of the dataframe halfway through the program. Type hinting these things is much harder than TypedDict/dataclass, at least doing it correctly & unambiguously is. It's practically a requirement to shove this stuff in a debugger/REPL because you'd have no chance otherwise.
Sure, the argument is that I'm just in a bad Pandas codebase, and it can be done much better. However what I take issue with is that this seems to be the overwhelming "culture" of Pandas. All Pandas code I've ever read is like this. If you look at tutorials, examples online, you see the same stuff. They all call everything the same name and program in the most dynamic & opaque fashion possible. Sure it's quick to write, and if you love Pandas you're used to it, but personally I wince every time I look in a method and see this stuff instead of normal code.
Personally I only use Pandas if I absolutely need it for performance, as a last resort.
Editor completion is an extremely low ranking aspect for choosing technologies for data analysis. If SQL is the better tool but you're not using it because it doesn't have editor completion, then you need a better editor. It pains me when people prioritise "developer experience in VS Code" over "actually the correct technological choice".
(I spend a good deal of my time and helping client use pandas and Polars.)
Even better is using tools like Narwhals and Ibis which can convert back and forth to any frames you want.
Oh yeah, toss numba or cython on top and you are back to numpy speed...
(Admittedly, there are more features, but in my book, I demonstrate pure Python code that runs as fast as numpy and cython with this decorator added.)