Typically, Pandas find its' sweet spot in usage in low- to medium-sized datasets up to a few million rows. Beyond this, more distributed frameworks such as Spark or Dask are usually preferred. It is, however, possible to scale pandas much beyond this point.
The typical issue with scaling with Pandas is how to deal with Pandas' memory utilization. Pandas leverage data stores within memory and don't keep data on Disk. However, it offers some generator functionality that allows iterating some of its datasets chunks by chunk.s
Pandas offer an extensive array of integration and export options, making it a pretty compelling reason to use the library. These integrations can, however be unoptimized for larger datasets and might need some tweaking.
One of the easiest memory improvements you can make is by providing the appropriate types to pandas. Pandas store columns in either NumPy arrays or their own specific data structures with specific types known as 'dtypes.'
Category type: The category type is a Pandas' built-in dtype. Category types provide a way to store 'repeated values more efficiently.' Pandas stores the exact values as part of a dictionary and store a numerical lookup key as part of the record. For example, storing a 'product_category_type: book' could be stored within the record as 1, and the metadata of the column would provide a lookup {1: ‘book’}. This can significantly reduce memory utilization when string values are long and very repeated.
String & String Arrow type: By default, pandas stores string types as python object types within its column. Pandas, however offer a specific 'string' dtype for managing text data types the string type. While it provides a significant number of improvements compared to just leveraging an object type — it does not provide any memory usage improvement. The string[pyarrow] type, however being about significant memory improvements over using standard strings types. PythonSpeed reported a >3x better memory utilization using this dtype over Python's string based on random 18 character strings.
Numerical types: By default, the numerical values are stored using 64bits, but for many use cases, a lower precision is perfectly acceptable. Leveraging 32 or 16 bits values allows reducing the memory utilization of these columns by 2–4x.
Chunking provides the opportunity to process only parts of the data at the time. As such, leveraging the chunk size offers some improvements in memory utilization. The output of the read_sql() function, when using the chunk size argument, will return a Generator of Data frames. While this is an excellent first step, it is worth noting that by default, the cursor will be handled client-side, downloading a relatively large amount of data on the client before releasing it.
Setting up the SQLAlchemy connection object with the right execution options and passing this conn object to pandas allows more effectively managing the memory utilization.
When leveraging the to_sql the statement, pandas convert under the hood the data frame onto an INSERT SQL statement that can be executed against the database. The to_sql statement will create a new object onto memory that will be a copy of the data frame as a SQL statement, increasing memory usage. This is accentuated by the fact that an INSERT SQL statement is a less efficient way to store data than a data frame.
One way to mitigate this issue is, of course, to leverage the chunk size parameter. In this case, we will have a SQL Statement object in memory that will only span a given number of rows, and pandas will manage the different calls to the database to push the data contained in the data frame. This operation can be slow, however.
The COPY statement in Postgres allows for a more efficient way to bulk load data onto the database. It comes, however, at the cost of additional constraints and complexity. The COPY statement, for instance, requires that your data is shaped in a similar way to the table it needs to be inserted onto. There are, of course, ways around this, such as creating a temporary target table in the database used for loading before finally doing a table-to-table insertion. However, this type of logic is not present in pandas directly and needs to be coded outside.
Chances are if you are trying to scale with Pandas, it involves dealing with optimized file formats such as parquets. Pandas provide support for reading and writing to parquet, but its support is currently quite unoptimized and requires a bit of know-how to be used effectively.
Leveraging chunks in a memory-optimized manner is possible with parquet when the files have been created with an appropriate row group size. Pandas, unfortunately, doesn't offer any chunking option when reading parquet files at this point.
By default, the read_parquet function of pandas leverages PyArrow read_table function, in itself, not providing chunking possibility. However, it does provide support for applying filters when reading rows, which can help manage some of the memory aspects when reading a few elements out of large files.
If you are using files stored in S3, a different library AWS Wrangler exists. Wrangler provides a higher level of abstraction and features built on top of pandas and pyarrow. It provides notably support for reading parquet files in chunks. However, one of the things it does not currently offer is the ability to leverage predictate pushdown on logical columns, only on partitions.
One of the first things to consider when leveraging the to_parquet function with pandas is that there is no definition of row group size by default. It is possible to provide it as a kwarg as this parameter is supported by default:
Setting this parameter to an appropriate value enables reading the parquet file iteratively at the cost of slightly higher storage.
The second thing to consider is that similarly to the read_parquet function, pandas does not yet offer chunking/iterator methods for writing parquet files. It is, of course, possible to write to multiple parquet files and keep memory usage low, but this usually a too many small files issues when trying to leverage big data tools such as Apache Spark.
One way to solve this is by leveraging the PyArrow library as described on StackOverflow: