It gets better! Because `numba`

has compiled your code to machine code, it is not limited by the requirement of the Python Virtual Machine that the Global Interpreter Lock (GIL) is held while Python code is being executed. This means that the machine code can be parallelised to run over all of the cores of your computer, and is not limited to running on a single core.

You can tell `numba`

to parallelise your code by adding `parallel=True`

to the decorator, and replacing `range`

with `numba.prange`

(parallel range). For example;

```
@numba.jit(parallel=True)
def calculate_roots(numbers):
num_vals = len(numbers)
result = np.zeros(num_vals, "f")
for i in numba.prange(0, num_vals):
result[i] = math.sqrt(numbers[i])
return result
```

Lets time the function now;

`timeit( calculate_roots(numbers) )`

On my computer I get;

`3.58 ms ± 66.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)`

This is only about 35% faster, even though my computer has 8 cores (and so could be up to 800% faster). Why isn’t this faster?

The reason is that ratio of computation to reading/writing to memory is very low for each loop. Here, we are have one square root calculation for every read from `numbers`

, and for every write to `result`

. This speed of this loop is thus likely to be limited by the maximum speed that the computer can read and write from memory. Adding more cores won’t speed it up much more.

We can demonstrate this by making a slightly more complex loop.

```
@numba.jit()
def calculate_roots_sum(numbers):
num_vals = len(numbers)
result = np.zeros(num_vals, "f")
for i in range(0, num_vals):
total = 0.0
for j in range(0, num_vals):
total += math.sqrt(numbers[j])
result[i] = total
return result
```

This loop calculates the sum of the square roots of all numbers, repeated as many times as there are numbers (yes, this is a bit unnecessary…). In this case, we will have one write to memory (`results[i]`

) for every `num_vals`

square roots and reads from memory (`numbers[j]`

).

Let’s test this with a smaller set of 10,000 random numbers.

`numbers = 500.0 * np.random.rand(10000)`

`timeit(calculate_roots_sum(numbers))`

On my computer I get;

`94.3 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)`

Now we will add `parallel=True`

and switch to `numba.prange`

for the outer loop.

```
@numba.jit(parallel=True)
def calculate_roots_sum(numbers):
num_vals = len(numbers)
result = np.zeros(num_vals, "f")
for i in numba.prange(0, num_vals):
total = 0.0
for j in range(0, num_vals):
total += math.sqrt(numbers[j])
result[i] = total
return result
```

`timeit(calculate_roots_sum(numbers))`

On my computer I get;

`16.9 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)`

This is over 500% faster, which is closer to what I would expect for my computer (4 fast cores plus 4 slow cores).

Use `parallel=True`

and `numba.prange`

to parallelise the `calculate_scores`

function of the script.

Using the

`timeit`

function, measure how long the function now takes to complete. How many times faster is the function compared to before you added the`@numba.jit()`

decorator? And how many times faster is it than the serial numba function? Make this comparison both for processing 5% and 100% of the data. Does the parallel implementation take twenty times longer to process twenty times the amount of data?Now measure how long the total script takes to run to process 100% of the data, using, e.g. the

`time`

function on MacOS/Linux, or`Measure-Command`

on Windows. How does the speed compare to the serial numba script to process 100% of the data? Can much more be gained by trying to optimise the`calculate_scores`

function further?