Breaking News

Data Science Bowl 2017 – Space-Time tricks

Data Science Bowl 2017 – Space-Time tricks

Here we are again for the third post on my journey to Deep Learning.

Like all super-heroes, the data scientist sometimes need to call upon greater power to solve the issue at hands.
Predicting lung cancer for the Data Science Bowl Competition by Booz Allen Hamilton and Kaggle is the perfect playground to learn such power.

What powers ? Today I am here to talk to you about how to manipulate space and time.
Compressing the data, Just-In-Time compiling and vectorizing your code, and parallelize your Data Science loops.
Bonus : run array operations on GPU.

Today’s heroes are: Numpy, Bcolz, Zarr, Numba and Joblib

Space manipulation

So here you are, ready to challenge this great competition, and to pocket $500 000.
And you discover that you have to download a 70GB .7z file.
And you discover that uncompressed it’s +150 GB.

And now you’re wondering how to store your preprocessed data after watershedding, connected component tresholding or Region of Interest generation.

Fear not, because you have 3 ways to store numpy arrays in a compressed manner.

First way – Pure Numpy

The first way is straightforward with pure NumPy and numpy.savez_compressed function.
You can load the data back with numpy.load.

Second way – Bcolz

The second way is with bcolz.
A code says more than a hundred words

Define a save_bcolz function

data_array should be a NumPy array, bcolz.carams are compression parameters, rootdir is the path on disk.
The data will be saved in a directory not in a compressed file.

Load the bcolz data :

mode=’r’ means open the data read-only, and [:] extract the NumPy array from the bcolz file.

Third way – Zarr

Zarr is an alternative to Bcolz. If you’re familiar with HDF5, it strives to support similar features, like group.
Also besides saving data in a directory like bcolz it can also save data in a single file.

Define a save_zarr function
In this example I will show how to use groups to save the ~1600 patient data.

id_patient is the name of the data I will use when I reload it later.

image is a numpy ndarray (here 3D array). Chunks are how data is cut to optimize storage space and time to decompress.

After saving the data I suggest you change the permission to read-only.
Zarr allows you to do on-disk computation so it’s very possible to modify on-disk data by mistake by manipulating zarr Arrays afterwards.

Load the zarr data :

[:] extract the NumPy array from the bcolz file.


Compression of the raw image is slightly better than .7z (and directly usable from Python)
Compressing Guido’s preprocessing output with bcolz, data only takes 1.16 GB
Compressing Ankasor’s preprocessing output with zarr, data only takes 0.76 GB

Crazy !

Time manipulation

Okkkaay, space is done.
While trying some kernels like Guido’s or Ankasor’s you probably realised that just to preprocess the data you would need a whole week. Oops.

And you also realize at one point that their preprocessing steps were only using 1 core of your multicore CPU, and no GPU even though you were manipulating images.

Okay let’s solve that.

First speed bump – Vectorize your code with Numba and Just-in-time compiling

By default, Python is an interpreted language, meaning when you run a Python code, it’s run line after line.
You can enable a lot of optimizations, if the code is analyzed ahead of being run and optimized (aka, vectorization).

The easiest way to do that for data science project is by using Numba and the @autojit decorator.
Note: General Python code that doesn’t use Numpy has a lot of other options like Pypy or Cython.

How ? by importing autojit and adding @autojit before computational function you want to accelerate.

Bonus : Use you GPU to go even faster

By importing cuda from numba, you can use the @cuda.jit decorator to run your code on GPU.
Check the documentation for what is supported.

Second speed bump – Run loop in parallel with joblib

Loops are an excellent opportunity to parallelize if ther eis no interdependance between each loop run (no n = n+1)

The joblib library makes that quite easy.
Let’s say you have a list of images, want to apply a preprocessing function to all those images and get back a new list of preprocessed image.

Note: if your preprocessing function does not return value you can just use straight:


Third speed bump – Delay preprocessing

Lastly if you're really stranded for CPU, you can use computational graphs with dask.
Basically, when you do y = function(x), instead of computing y right away, dask will store the "computation graph".
Then when you actually need the computation, it will optimize the ressources needed, it can even distribute computation on multiple computers.

I'll let you check the documentation.


That’s all folks,

Happy deep learning

Related Articles

High performance tensor library in Nim

Toward a (smoking !) high performance tensor library in Nim Forewords In April I started Arraymancer, yet another tensor library™,

Journey to Deep Learning: Cuda GPU passthrough to a LXC container

A tutorial on how to passthrough a Nvidia GPU to a LXC container in Proxmox.

Journey to Deep Learning #2: Don’t fight the wrong fights

2 months ago, I took my courage and threw it at Machine Learning. Machine Learning, at least the supervised learning,