Breaking News

Journey to Deep Learning #2: Don’t fight the wrong fights

Journey to Deep Learning #2: Don’t fight the wrong fights

2 months ago, I took my courage and threw it at Machine Learning.
Machine Learning, at least the supervised learning, is the art of training
a black box (a young panda) by telling it what you expect.
When it’s properly trained, feed it unseen data (a new meal)
and it will show you what to expect.

Okay, enough with this silly analogy. 2 weeks ago, I bought a graphic card (GPU), to dive deeper in Machine learning and start … Deep Learning !
By a stroke of luck the same day Booz Allen Hamilton launched Data Science Bowl, a competition to detect lung cancer with $1 million of prizes.
Wow, so now I can do data for good, have fun, and dream about getting paid for it, sign me up!

Now you will ask me, how come you have time for this blog post? Well … I’ve been staring at this screen since Saturday 3AM:

Learning deep waiting
There are about 1600 patients, computer is processing 30 per hour …
I think I found a way to compute that on my GPU and cut that by at least 20,
except that I don’t dare to interrupt the panda  …. computation.

So since I suddenly have some free time, here are some tips for you aspiring Data Scientist.
Don’t fight the wrong fights, you will have plenty with:

  • Preprocessing your data
  • Feature extraction
  • Feature engineering (check this loooong article)
  • Feature selection
  • Model(s) selection
  • Optimizing hyperparameters
  • Model stacking

And last but not least, if you are in a company, put that in production.

1. Hardware selection

  • I hope you have 16GB of RAM
  • I hope you bought an Nvidia GPU with at least 6GB of RAM
  • I hope you have 220GB of free disk space
  • I heard you wanted a quiet life.

There is something called “The curse of dimensionality”.
Machine learning is not Big Data, but it’s often too much data.
For example Data Science Bowl dataset is about 1600 3D images, 150GB total delivered in a 70 GB archive, oops.
Now if you do deep learning on your GPU, the GPU must be able to load:

  • at least 1 image (1Gig)
  • its neural network a.k.a. a black box with typically millions of parameters to optimize.
  • the forward pass (the data transformation pass)
  • the backpropagation pass (tuning of parameters)
  • the display sent to your monitor. (hint: use a dedicated Deep Learning GPU)

Side note about the quiet life, deep learning training takes days with your computer running at full speed.
Would you like to sleep and wake up to the sound of a vacuum cleaner? Buy silent parts !

Speaking of cleaning, I almost fried mine while cleaning and not putting the CPU fan the wrong way back.
I’ll do a blog post on my hardware later.

2. OS to use (Windows, Linux, Microsoft Azure, Amazon Web Services ?)

  • Use whatever you are familiar with. Anaconda have most of the data scientists tools in one neat package.

Use your time to learn data science at first.
Here is a shameless plug for my tutorial on how to passthrough your GPU to a linux LXC container.
Normally you can apply it to Docker too.

3. Choosing programming language and library

  • There are 2 kings in data science: R and Python, both have a thriving ecosystem.

Use a high level library that wraps a low level ones.
R has H2O (great guys !), mxnet and deepnet.
Python has Keras (what I use), mxnet, Lasagne, Nolearn, Nervana Neon.
“Low level” libraries include Tensorflow, Theano.

Challengers are C++ (Caffè), Julia (Mocha), Torch (Lua).
Their main issue is not deep learning but the ecosystem to preprocess the data.

4. The mono-tasking computer

  • Deep Learning is very jealous. Want to use your computer for something else? Enjoy the slugginess!

5. Fighting the libraries

Now we’re talking, you’ve graduated from Deep Learning Initiate, you ran a few models
and you realise that even the preprocessing step takes days (like me yesterday).

So you engage in a quest to:

- Compile libraries from source to get speed up like CUDA computing on GPU, AVX2 vectorization support and what else ...

If you’re on Archlinux, rejoice, I got you covered, here is my github repo.

6. Overfitting

Overfitting is like training a panda to put a triangle in a triangle hole.
Then giving him a square hole, a triangle and a square and it is stumped.
Oops, so instead of having a smart network that generalize you have a dumb one and you’re back to the design page.

7. Contaminating your cross-validation set.

Cross-validating is how you check overfitting, you hold back some data during the panda training stage
and at the end you give him some triangles and squares to make sure he behaves.

Wow let’s say the panda is supposed to give predictions based on the average number of squares he saw.
oo often, people use the mean of the whole set and forget to remove the “hidden” cross validating set from this computation.
hat’s contaminated, put that away from me.

8. Not saving your model and results

Enjoy you panda tripping the power cord much ?

9. Not using version control

“Ah, I did that, result was great, what did I do again, mmmh was it a learning rate of 0.014 or 0.017?”

10. Not having time

Take it or leave it.

Bonus: fighting WordPress for 3 hours to publish a properly formatted post and losing … (Sorry … I tried)

It’s Word all over again ….

Enjoyed the read? Leave a comment or share the article, much appreciated !

By the way, save the data scientists, save the pandas !

Tags assigned to this article:
deep learninghumourmachine learning

Related Articles

Data Science Bowl 2017 – Space-Time tricks

Here we are again for the third post on my journey to Deep Learning. Like all super-heroes, the data scientist

Journey to Deep Learning: Cuda GPU passthrough to a LXC container

A tutorial on how to passthrough a Nvidia GPU to a LXC container in Proxmox.

Predicting apartment interest from listing with structured data, free text, geolocalization, time data and images

3 months ago, Data Science competition website Kaggle published a challenge from Two Sigma, an investment fund, and Renthop, a