# Parallel programming with numba

It gets better! Because numba has compiled your code to machine code, it is not limited by the requirement of the Python Virtual Machine that the Global Interpreter Lock (GIL) is held while Python code is being executed. This means that the machine code can be parallelised to run over all of the cores of your computer, and is not limited to running on a single core.

You can tell numba to parallelise your code by adding parallel=True to the decorator, and replacing range with numba.prange (parallel range). For example;

@numba.jit(parallel=True)
def calculate_roots(numbers):
num_vals = len(numbers)
result = np.zeros(num_vals, "f")

for i in numba.prange(0, num_vals):
result[i] = math.sqrt(numbers[i])

return result

Lets time the function now;

timeit( calculate_roots(numbers) )

On my computer I get;

3.58 ms ± 66.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This is only about 35% faster, even though my computer has 8 cores (and so could be up to 800% faster). Why isn’t this faster?

The reason is that ratio of computation to reading/writing to memory is very low for each loop. Here, we are have one square root calculation for every read from numbers, and for every write to result. This speed of this loop is thus likely to be limited by the maximum speed that the computer can read and write from memory. Adding more cores won’t speed it up much more.

# Parallelising a more complex loop

We can demonstrate this by making a slightly more complex loop.

@numba.jit()
def calculate_roots_sum(numbers):
num_vals = len(numbers)
result = np.zeros(num_vals, "f")

for i in range(0, num_vals):
total = 0.0

for j in range(0, num_vals):
total += math.sqrt(numbers[j])

result[i] = total

return result

This loop calculates the sum of the square roots of all numbers, repeated as many times as there are numbers (yes, this is a bit unnecessary…). In this case, we will have one write to memory (results[i]) for every num_vals square roots and reads from memory (numbers[j]).

Let’s test this with a smaller set of 10,000 random numbers.

numbers = 500.0 * np.random.rand(10000)
timeit(calculate_roots_sum(numbers))

On my computer I get;

94.3 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now we will add parallel=True and switch to numba.prange for the outer loop.

@numba.jit(parallel=True)
def calculate_roots_sum(numbers):
num_vals = len(numbers)
result = np.zeros(num_vals, "f")

for i in numba.prange(0, num_vals):
total = 0.0

for j in range(0, num_vals):
total += math.sqrt(numbers[j])

result[i] = total

return result
timeit(calculate_roots_sum(numbers))

On my computer I get;

16.9 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This is over 500% faster, which is closer to what I would expect for my computer (4 fast cores plus 4 slow cores).

# Exercise

Use parallel=True and numba.prange to parallelise the calculate_scores function of the script.

• Using the timeit function, measure how long the function now takes to complete. How many times faster is the function compared to before you added the @numba.jit() decorator? And how many times faster is it than the serial numba function? Make this comparison both for processing 5% and 100% of the data. Does the parallel implementation take twenty times longer to process twenty times the amount of data?

• Now measure how long the total script takes to run to process 100% of the data, using, e.g. the time function on MacOS/Linux, or Measure-Command on Windows. How does the speed compare to the serial numba script to process 100% of the data? Can much more be gained by trying to optimise the calculate_scores function further?

Answer to the above exercise