by Mamy Ratsimbazafy | February 12, 2017 9:24 pm
Here we are again for the third post on my journey to Deep Learning.
Like all super-heroes, the data scientist sometimes need to call upon greater power to solve the issue at hands.
Predicting lung cancer for the Data Science Bowl Competition by Booz Allen Hamilton and Kaggle is the perfect playground to learn such power.
What powers ? Today I am here to talk to you about how to manipulate space and time.
Compressing the data, Just-In-Time compiling and vectorizing your code, and parallelize your Data Science loops.
Bonus : run array operations on GPU.
Today’s heroes are: Numpy, Bcolz, Zarr, Numba and Joblib
So here you are, ready to challenge this great competition, and to pocket $500 000.
And you discover that you have to download a 70GB .7z file.
And you discover that uncompressed it’s +150 GB.
And now you’re wondering how to store your preprocessed data after watershedding, connected component tresholding or Region of Interest generation.
Fear not, because you have 3 ways to store numpy arrays in a compressed manner.
The first way is straightforward with pure NumPy and numpy.savez_compressed function.
You can load the data back with numpy.load.
The second way is with bcolz.
A code says more than a hundred words
Define a save_bcolz function
def save_bcolz(data_array, patient_id, outFolder):
outFile = outFolder + patient_id + '.bcolz'
z = bcolz.carray(
bcolz.cparams(clevel=9, cname="zstd", shuffle=2), # "zstd" is the state-of-art compressor
dtype='int16', # "int16" if possible save as integer for maximum compression
z.flush() #Make sure data is written to disk
data_array should be a NumPy array, bcolz.carams are compression parameters, rootdir is the path on disk.
The data will be saved in a directory not in a compressed file.
Load the bcolz data :
def load_bcolz(patient, inFolder):
return bcolz.open(inFolder + patient, mode='r')[:]
mode=’r’ means open the data read-only, and [:] extract the NumPy array from the bcolz file.
Zarr is an alternative to Bcolz. If you’re familiar with HDF5, it strives to support similar features, like group.
Also besides saving data in a directory like bcolz it can also save data in a single file.
Define a save_zarr function
In this example I will show how to use groups to save the ~1600 patient data.
# First set some global variable
# A store (DirectoryStore, ZipStore, MemoryStore) is where you wish to store the data, directory on disk, single file on disk (slower), in-memory (not persistent)
# Then one or more groups to save your data. It's like having a filesystem (folder/subfolder/data), check the [group documentation](https://zarr.readthedocs.io/en/latest/api/hierarchy.html).
ZARR_STORE_PREPROC = zarr.DirectoryStore('./data/compressed_preproc.zarr')
ZARR_GROUP_PREPROC = zarr.hierarchy.open_group(store=ZARR_STORE_PREPROC, mode='w-')
def save_zarr(id_patient, image):
chunks=(128, 128, 128),
compressor=zarr.Blosc(clevel=9, cname="zstd", shuffle=2)
id_patient is the name of the data I will use when I reload it later.
image is a numpy ndarray (here 3D array). Chunks are how data is cut to optimize storage space and time to decompress.
After saving the data I suggest you change the permission to read-only.
Zarr allows you to do on-disk computation so it’s very possible to modify on-disk data by mistake by manipulating zarr Arrays afterwards.
Load the zarr data :
# First global variables
# Check that you load data read-only with mode='r'
[:] extract the NumPy array from the bcolz file.
Compression of the raw image is slightly better than .7z (and directly usable from Python)
Compressing Guido’s preprocessing output with bcolz, data only takes 1.16 GB
Compressing Ankasor’s preprocessing output with zarr, data only takes 0.76 GB
Okkkaay, space is done.
While trying some kernels like Guido’s or Ankasor’s you probably realised that just to preprocess the data you would need a whole week. Oops.
And you also realize at one point that their preprocessing steps were only using 1 core of your multicore CPU, and no GPU even though you were manipulating images.
Okay let’s solve that.
By default, Python is an interpreted language, meaning when you run a Python code, it’s run line after line.
You can enable a lot of optimizations, if the code is analyzed ahead of being run and optimized (aka, vectorization).
The easiest way to do that for data science project is by using Numba and the @autojit decorator.
Note: General Python code that doesn’t use Numpy has a lot of other options like Pypy or Cython.
How ? by importing autojit and adding @autojit before computational function you want to accelerate.
from numba import autojit
y,x = img.shape
startx = x//2-(cropx//2)
starty = y//2-(cropy//2)
Bonus : Use you GPU to go even faster
By importing cuda from numba, you can use the @cuda.jit decorator to run your code on GPU.
Check the documentation for what is supported.
Loops are an excellent opportunity to parallelize if ther eis no interdependance between each loop run (no n = n+1)
The joblib library makes that quite easy.
Let’s say you have a list of images, want to apply a preprocessing function to all those images and get back a new list of preprocessed image.
from joblib import Parallel, delayed
#preproc_function is a function that accepts a single image as argument and return an im.age
images = Parallel(n_jobs=-1)(delayed(preproc_function)(image) for image in images)
Note: if your preprocessing function does not return value you can just use straight:
Parallel(n_jobs=-1)(delayed(preproc_function)(image) for image in images)
Lastly if you're really stranded for CPU, you can use computational graphs with dask.
Basically, when you do y = function(x), instead of computing y right away, dask will store the "computation graph".
Then when you actually need the computation, it will optimize the ressources needed, it can even distribute computation on multiple computers.
I'll let you check the documentation.
That’s all folks,
Happy deep learning
Source URL: https://andre-ratsimbazafy.com/data-science-bowl-2017-space-time-tricks/
Copyright ©2017 Marie & Mamy's Insights unless otherwise noted.