Exercise

import numba

@numba.jit(cache=True, parallel=True)
def calculate_scores(data):
    """Calculate the score for each row. This is calculated
       as the sum of pairs of values on each row where the values
       are not equal to each other, and neither are equal to -1.

       Returns
       =======

            scores : numpy array containing the scores
    """
    nrows = data.shape[0]
    ncols = data.shape[1]

    # Here is the list of scores
    scores = np.zeros(nrows)

    # Loop over all rows
    for irow in numba.prange(0, nrows):
        for i in range(0, ncols):
            for j in range(i, ncols):
                ival = data[irow, i]
                jval = data[irow, j]

                if ival != -1 and jval != -1 and ival != jval:
                    scores[irow] += 1

    return scores

import slow

(ids, varieties, data) = slow.load_and_parse_data(5)
scores = slow.calculate_scores(data)

timeit(slow.calculate_scores(data))

(ids, varieties, data) = slow.load_and_parse_data(100)
scores = slow.calculate_scores(data)

timeit(slow.calculate_scores(data))

On my laptop I get

739 µs ± 94.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

for processing 5% of the data, and

11.6 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

for processing 100% of the data.

time slow.py

The best score 21931.0 comes from pattern MDC054122.001_13602
python slow.py  0.87s user 0.42s system 254% cpu 0.508 total

Parallelising the loop has sped up the function by over 4200 times compared to the Python version when processing 5% of the data. It is also over 4.8 times faster than the serial numba version. This is about what I would expect for my laptop (4 fast cores, 4 slow cores). It is even faster when processing 100% of the data, running 6.6 times faster than the serial version. There is more data to process, and so the cost of spinning up and shutting down each thread is less, relative to the cost of the calculation. This is why processing 20 times the amount of data takes only ~16 times longer.
The parallel script is 77 ms faster than the serial script. This is about the same as the difference between the speeds of the serial and parallel versions (65 ms). There is little to be gained for this data set by further optimising calculate_scores as it now takes a very small proportion of the total runtime of the script (~12 ms out of ~500 ms). It would only be worth optimising this function if the amount of data to be processed increases significantly (at least by 10-100 times). This is because we have shown that the cost of the calculate_scores function increases linearly with the amount of data to process (about ~16-20 times longer for 20 times the amount of data).

Accelerating Python

Exercise

Back