Insights
Speeding up your Python data pipelines with Cython and Nim
6 min read
By Julien Kervizic

There are many ways to speed up Python, from writing more efficient python code to leveraging optimized libraries such as Numpy when doing mathematical computation. But it is also possible to speed up python by writing C extension code or using other languages such as Cython or Nim that compile to C/C ++ and allow to extend python in a more efficient way.

Cython and Nims allow you to embed some Python code as part of your coding in the new. It is for instance possible to create a Pandas DataFrame and call its methods in either of these languages. Some of the speed advantages offered by these languages include compile-time optimization, proper multiprocessing support, and optimized data structure through appropriate type definition.

The performance gains can be quite substantial compared to vanilla python, while at the same time offering accessibility and flexibility.

Cython

Installing on a MAC

Setting up the necessary tooling to make Cython work properly on Mac can be quite a journey.

C Compiler

In order to be able to build Cython code, it is necessary to have a c compiler installed on the mac. The instruction for this is available in the following post:

If you want to take advantage of Parallel processing, some additional steps are needed. For some reason, Apple makes it quite difficult to run code with OpenMP, you need to install a different compiler through homebrew for this purpose:

brew install llvm

It is also convenient to import libomp brew install libomp in order to take advantage of Parallel operations.

Numpy

We have to make sure that Numpy is discoverable by the c compiler otherwise, we will run into “fatal error: ‘numpy/arrayobject.h’ file not found setuptools” . This can be done with a quick export command:

This specific path can be obtained by typing np.get_include(), in a python shell.

Using Cython

Interacting with Setup tools & Jupyter notebook

Jupyter notebook has some very neat extensions set up in order to allow for in-line compilation of Cython code.

The only thing needed is to load the extension using %load_ext Cython and use the magic function %%cython before you input the cython code.

Operating Cython

Cython can be easily setup in code, by “c importing” the cython library and defining c function. In the example below we have one Python function “sum_np_column_arrays” and one C function “c_sum_np_column_arrays” for instance. We need to have a python function available to serve as an interface to the rest of our python code:

Running this Cython code directly is hum slow …

Compared to vectorized NumPy operation it is 75x slower (110ms) but slightly better than applying python directly (8.54s):

Optimize

The magic function %%cython -a allows us to get a better sense of what might be slowing down our code, by highlighting the different computational switches between python code and C code.

Type Annotation

Adding type annotations brings us on par with the performance of vectorized operation in Numpy.

Multithreading / Multiprocessing

Setting up multiprocessing allows us to extract an additional 50bps improvement over the type annotated example and Numpy’s built-in computation (~110ms).

This however requires some work in configuring OpenMP or …

When do you really get the benefits of Cython

There are many places where you can get performance benefits by using Cython. The main one is when you have highly parallelizable code when there are a large amount of data that could be easily typed or when you need to compute multiple row operations at a time that are not easily vectorized.

Nim

Nim is another programming language with a syntax that is somewhat inspired by Python but significantly different. It is a statically typed compiled language. Nim compiles, to JavaScript, C and C++. It is through its C/C++ compilation, that Nim can be leveraged alongside Python.

Nim has a few components in its ecosystem that facilitate its’ integration with Python. Nimpy, which is a Nim library that allows to call python from Nim, and provides some facilities to export to Python. Nimpy can easily be installed with nimble, Nim’s package manager. The other is Nimporter , Nimporter is a Python library that facilitates the integration of Nim and relies on Nimpy as a dependency. It allows for the run-time compilation of the nim code and its’ dependencies and pre-compiling the different extensions.

Installing Nims on a Mac

Similar to Cython, Nims requires a C compiler such as clang or LLVM to be installed on your computer. To be able to leverage Python within Nims, you will also need to install Nimpy and have Python installed on your system.

The Nimpy package can be imported nimble install nimpy , and Nimporter can in turn be installed with the Python package manager pip pip install nimporter .

Create C Extensions

The following command would compile a file named basic_import_lib onto an importable c extension basic_import_lib.so .

Once compiled and exported, the extension can then be imported into a Python file.

Leverage Python in Nims

In order to leverage standard Python functions within Nim, it is necessary to use the pyBuiltinsModule . It can easily be defined in the following manner once you have Nimpy imported:

The Pythons function will then be available under the py object such as py.len or py.sum .

Numpy

Pandas

Similarly to Numpy, it is possible to import pandas in Nim. In the example below, you can see a DataFrame being explicitly defined as well as a DataFrame being imported from a parquet file.

Dealing with python

Python objects sometimes need to be cast to proc (function), An example of that would be the piece of code: to(py.int(py.len(df_chunk.index)), int) return the length of a data frame chunk index onto Nim. The df_chunk is processed using python’s length function and cast to a Python integer before being cast as a Nim’s integer, which can be summed and processed natively.

Concurrent and Parallel operations

Similar to Cython, it is possible to leverage parallel code with Nim with OpenMP. This is done in a for loop leveraging the || operator.

On top of this, Nim supports parallelism through the threadpool library and the parallel and the spawn statements. Additionally, there is parallelism support through weave.

Summary

Both Cython and Nim offer ways to achieve better performance in Python, by allowing to have interoperable code between the different languages. This however increases complexity and should be managed carefully. The choice of the language to use is up to personal preferences as ultimately they both compile to C code.

Privacy Policy
Sitemap
Cookie Preferences
© 2024 WiseAnalytics