Journey to Deep Learning #2: Don’t fight the wrong fights

by Mamy Ratsimbazafy | January 30, 2017 11:52 pm

2 months ago, I took my courage and threw it at Machine Learning.
Machine Learning, at least the supervised learning, is the art of training
a black box (a young panda) by telling it what you expect.
When it’s properly trained, feed it unseen data (a new meal)
and it will show you what to expect.

Okay, enough with this silly analogy. 2 weeks ago, I bought a graphic card (GPU), to dive deeper in Machine learning and start … Deep Learning !
By a stroke of luck the same day Booz Allen Hamilton launched Data Science Bowl[1], a competition to detect lung cancer with $1 million of prizes.
Wow, so now I can do data for good, have fun, and dream about getting paid for it, sign me up!

Now you will ask me, how come you have time for this blog post? Well … I’ve been staring at this screen since Saturday 3AM:

Learning deep waiting
There are about 1600 patients, computer is processing 30 per hour …
I think I found a way to compute that on my GPU and cut that by at least 20,
except that I don’t dare to interrupt the panda  …. computation.

So since I suddenly have some free time, here are some tips for you aspiring Data Scientist.
Don’t fight the wrong fights, you will have plenty with:

And last but not least, if you are in a company, put that in production.

1. Hardware selection

There is something called “The curse of dimensionality”.
Machine learning is not Big Data, but it’s often too much data.
For example Data Science Bowl dataset is about 1600 3D images, 150GB total delivered in a 70 GB archive, oops.
Now if you do deep learning on your GPU, the GPU must be able to load:

Side note about the quiet life, deep learning training takes days with your computer running at full speed.
Would you like to sleep and wake up to the sound of a vacuum cleaner? Buy silent parts !

Speaking of cleaning, I almost fried mine while cleaning and not putting the CPU fan the wrong way back.
I’ll do a blog post on my hardware later.

2. OS to use (Windows, Linux, Microsoft Azure, Amazon Web Services ?)

Use your time to learn data science at first.
Here is a shameless plug for my tutorial on how to passthrough your GPU to a linux LXC container[3].
Normally you can apply it to Docker too.

3. Choosing programming language and library

Use a high level library that wraps a low level ones.
R has H2O (great guys !), mxnet and deepnet.
Python has Keras (what I use), mxnet, Lasagne, Nolearn, Nervana Neon.
“Low level” libraries include Tensorflow, Theano.

Challengers are C++ (Caffè), Julia (Mocha), Torch (Lua).
Their main issue is not deep learning but the ecosystem to preprocess the data.

4. The mono-tasking computer

5. Fighting the libraries

Now we’re talking, you’ve graduated from Deep Learning Initiate, you ran a few models
and you realise that even the preprocessing step takes days (like me yesterday).

So you engage in a quest to:

- Compile libraries from source to get speed up like CUDA computing on GPU, AVX2 vectorization support and what else ...

If you’re on Archlinux, rejoice, I got you covered, here is my github repo[4].

6. Overfitting

Overfitting is like training a panda to put a triangle in a triangle hole.
Then giving him a square hole, a triangle and a square and it is stumped.
Oops, so instead of having a smart network that generalize you have a dumb one and you’re back to the design page.

7. Contaminating your cross-validation set.

Cross-validating is how you check overfitting, you hold back some data during the panda training stage
and at the end you give him some triangles and squares to make sure he behaves.

Wow let’s say the panda is supposed to give predictions based on the average number of squares he saw.
oo often, people use the mean of the whole set and forget to remove the “hidden” cross validating set from this computation.
hat’s contaminated, put that away from me.

8. Not saving your model and results

Enjoy you panda tripping the power cord much ?

9. Not using version control

“Ah, I did that, result was great, what did I do again, mmmh was it a learning rate of 0.014 or 0.017?”

10. Not having time

Take it or leave it.

Bonus: fighting WordPress for 3 hours to publish a properly formatted post and losing … (Sorry … I tried)

It’s Word all over again ….

Enjoyed the read? Leave a comment or share the article, much appreciated !

By the way, save the data scientists, save the pandas !

Endnotes:
  1. Data Science Bowl: http://www.datasciencebowl.com/
  2. article: http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/
  3. how to passthrough your GPU to a linux LXC container: https://andre-ratsimbazafy.com/cuda-gpu-passthrough-to-a-lxc-container/
  4. github repo: https://github.com/mratsim/Arch-Machine-Learning

Source URL: https://andre-ratsimbazafy.com/journey-deep-learning-2-dont-fight-wrong-fights/