As for numba, you can parallelise cython-compiled code because it is not limited by the requirement to go through the Python Virtual Machine, and to hold the GIL.
Parallelising code is similar to numba, in that you have to use a prange to parallelise loops. A restriction is that you are only allowed to use prange when you are holding the GIL (with nogil:) and when you have marked a parallel section (with parallel()). We normally combine these two together into with nogil, parallel().
For example, here is a serial cython version of the more complex calculate_roots_sum function from before. We will copy this into a file called calculate_roots_sum.pyx;
#cython: language_level=3
cimport cython
from libc.math cimport sqrt
import numpy as np
@cython.boundscheck(False)
def calculate_roots_sum(numbers):
cdef int num_vals = len(numbers)
result = np.zeros(num_vals, "f")
cdef double[::1] numbers_view = numbers
cdef float[::1] result_view = result
cdef int i = 0
cdef int j = 0
cdef float total = 0.0
with nogil:
for i in range(0, num_vals):
total = 0.0
for j in range(0, num_vals):
total = total + sqrt(numbers_view[j])
result_view[i] = total
return result
We can load and time this in a Jupyter notebook using
import pyximport
pyximport.install()
import calculate_roots_sum
Next, we will time it on a set of 10,000 random numbers
import numpy as np
numbers = 500.0 * np.random.rand(10000)
timeit(calculate_roots_sum.calculate_roots_sum(numbers))
281 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
We can parallelise this loop by adding parallel() to the with nogil: line, and changing for i in range(0, num_vals) to for i in prange(0, num_vals).
(note that the parallel and prange function had to be imported from the cython.parallel module)
#cython: language_level=3
cimport cython
from libc.math cimport sqrt
import numpy as np
from cython.parallel import parallel, prange
@cython.boundscheck(False)
def calculate_roots_sum(numbers):
cdef int num_vals = len(numbers)
result = np.zeros(num_vals, "f")
cdef double[::1] numbers_view = numbers
cdef float[::1] result_view = result
cdef int i = 0
cdef int j = 0
cdef float total = 0.0
with nogil, parallel():
for i in prange(0, num_vals):
total = 0.0
for j in range(0, num_vals):
total = total + sqrt(numbers_view[j])
result_view[i] = total
return result
Let us now retime this. Clear the Jupyter notebook kernel and re-run the import and timing code…
import pyximport
pyximport.install()
import calculate_roots_sum
import numpy as np
numbers = 500.0 * np.random.rand(10000)
timeit(calculate_roots_sum.calculate_roots_sum(numbers))
288 ms ± 4.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Parallelising the code has made it slower?
For cython you have to add compiler command line options that turn on parallel compilation. The compiler option is -fopenmp on Linux and MacOS, or, if this doesn’t work on Windows, use /openmp.
We’ve seen how you can add compiler command line options in a setup.py file. You can also add compiler command line options by creating a file for each of your .pyx files. This should have the same name as the .pyx file, but with extension .pyxbld. The file has the same contents, which are copied below.
Create a file called calculate_roots_sum.pyxbld and copy in;
def make_ext(modname, pyxfilename):
from distutils.extension import Extension
ext = Extension(name = modname,
sources=[pyxfilename],
extra_compile_args=['-fopenmp'],
extra_link_args=['-fopenmp'])
return ext
This file can have the same contents for any .pyx file. The key lines are
extra_compile_args=['-fopenmp'],
extra_link_args=['-fopenmp'])
where we add the -fopenmp compile flag. This switches on OpenMP.
Note that Windows compilers may need to use /openmp instead of -fopenmp
The -fopenmp flag is not supported by the default compiler on MacOS. To use this, you need to install another compiler, e.g. clang, by installing via homebrew, by typing brew install --with-toolchain llvm. This should install clang, likely into /opt/homebrew/Cellar/llvm/11.1.0/bin/clang.
You then need to tell the Jupyter notebook to use this compiler by setting the CC environment variable, e.g. via
import os
os.environ["CC"] = "/opt/homebrew/Cellar/llvm/11.1.0/bin/clang"
(note that you should use the path to your installed clang)
We can now clear the Jupyter notebook and rerun the code to import our module and run the timing (remembering to set the CC environment variable correctly if you are on MacOS)
import pyximport
pyximport.install()
import calculate_roots_sum
import numpy as np
numbers = 500.0 * np.random.rand(10000)
timeit(calculate_roots_sum.calculate_roots_sum(numbers))
55.4 ms ± 2.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
This is 5.2 times faster than the serial code, which is what I would expect from my 4+4 core laptop.
Edit your copy of cyslow.pyx to add in the parallel section (with parallel()) and also to switch to using a prange parallel range.
Next, edit your setup.py to include the -fopenmp option (or, if this doesn’t work on Windows, use /openmp).
Compile your module again, using
python setup.py build_ext --inplace
Note, on MacOS, you will need to set the path to your clang compiler, e.g.
CC=/opt/homebrew/Cellar/llvm/11.1.0/bin/clang python setup.py build_ext --inplace
where you should use your own path to your clang compiler.
Import the cyslow module into your Jupyter notebook and use timeit to measure how long the calculate_scores function now takes for 100% of the data.
How does this compare to the serial cython code? Or to the serial or parallel numba-accelerated code?
Edit cyslow_main.py to load 100% of the data. Run this script and time it. How does this compare to the runtime of the serial cython code, or the runtime of the serial and parallel numba code?
(BONUS) Edit your cyslow.pyx to include a tqdm progress bar. Note that you will need to use the same technique of chunking the loops into blocks, as the progress bar has to be outside of the with nogil section.