High performance tensor library in Nim

by Mamy Ratsimbazafy | October 15, 2017 3:25 pm

Toward a (smoking !) high performance tensor library in Nim

Forewords

In April I started Arraymancer[1], yet another tensor library™, in yet another language™, Nim[2].

Why you ask?

I wanted to have fun. When I’m bored, I like to learn new things, research them “in-depth” and do things with that.

It’s a rabbit hole, really:

In January I geared up for Data Science Bowl #3[6], a 1 million dollar competition to predict lung cancer sponsored by the US National Cancer Institute, downloaded 160GB of 3D lung scans and …

Damn Python/Numpy/Scipy was slow, I spent more time preprocessing the images with watershed segmentation (a full night) than training my neural network (3 hours).

Lung segmentation

Image courtesy Ashish Shah[7], Guido Zuidhof[8] and Arnav Jain[9]

That means that I couldn’t test as fast as I wished. I just stopped there and wanted to learn more about neural networks and computer vision behind the scene. I watched Standford’s CS231n courses[10] and did the Hacker’s guide to neural network[11] tutorial by Andrej Karpathy.

It was in Javascript, I wanted to learn another language I stumbled upon after my frustrations with Rust[12]: Nim. So here I go.

Until I reached this:

No way I’m coding that, so I wrote an autograd to automatically differentiate functions, available here[13].

Then I wanted to generalize this autograd to vectors, matrices and tensors, 4D tensors especially so I could pass N images, C color channels (RGB), H height and W width.

Nim ecosystem being young (Nim is 8 years old, Python 27 years old), nothing suited my needs so I started my own with the goals of:

Table of Contents

Why Nim?

Ergonomics

Cognitive overload

Building something fast but too hard to use is a tragedy. Scientific computing is an exercise in fast experimentations and iterations and the programming language should be in the shadows so that the scientist focus on his algorithms.

That’s an obvious fail for Rust, the borrow checker brings too much cognitive overload.

Generics

Furthermore maths and science is all about generalization and extracting overarching concepts out of things.

I don’t want to write this kind of things in 2017:

This is go by the way taken from gonum library[36]. Just give me generics like in C++, Rust, Haskell …

Here I just tell the compiler that only signed integer (int32, int64) and floating points (float32, float64) input are valid.

Flexible static-typing

Static-typing is often seen as an obstacle in science for fast iterations. This does not fly. So that Python numerical libraries are robust, most functions in Python are buried deep in input checks to make sure the input is correct.

 

Furthermore, if physicists were to know about the following features we might have avoided airplanes and space shuttle crashes[37]:

Note: Nim provides tools to facilitate implicit conversion

You probably noticed that you had to tell which operators were valid like + and *. This is actually a feature because multiplying Meters should give Meters2 not Meters

How awesome is that? Note: this is actually also useful in Finance to avoid adding two different currencies or British pounds with pennies.

Operator overloading and infix operators

In the previous section you saw * and +, Nim allows you to reuse the same operators to actually mean different things depending on the input. * between two float64 give a float64, but between Meters it gives Meters2, between matrices and a vector it gives a vector, etc.

Also Nim allows you to create your own operator by combining any number of these symbols:

 

So I could create proc * to mean matrix multiplication and .* to mean elementwise multiplication like in Julia and Matlab. I could even use <*> to mean Haskell applicative map but let’s not go there.

Functional programming facilities

Functions are first-class citizen in Nim and can be passed to other functions. I let my solution to Project Euler[38] problem #5 speak for me.

Metaprogramming: the boilerplate squashing tool

I’m fond of DRY (Don’t Repeat Yourself). Object-Oriented Programming, getter, setters, etc are not for me, Java especially is awfully guilty of inflating the number of lines of code to the extreme.

I like my programs lean and fast.

So let’s say I want to implement “universal functions” in my library, in the Numpy jargon that means a function that applies element-wise to all elements of the multidimensional array, like sin, cos, exp, sqrt, ln/log

Ingredient:

How boring would it be to use that:

Instead you can use a template that will do the boilerplate for you:

You might say, that I saved 1 line per new function, useless.
Actually this is EXTREMELY useful to wrap Cuda Kernels, like so:

Suddenly it’s 10 lines saved per operands, convinced?

Ok, last try, thanks to Nim metaprogramming I could implement this very nice slicing syntax by just bending the language to my will, like Neo in the Matrix:

 

Bending Nim

Syntax for humans

There is a say that a dev spent 90% of his time reading code.
I read a lot, and fast. Except that when I’m reading books I don’t have to read things like this:

Too much visual noise.

Also curly braces are not sexy anymore. In any case even if there are curly braces it does not excuse you from making your code readable with indentation for logical blocks (if/else, for loop, function definition …).
Why use curly braces to tell the computer that it’s a logical block and use indentation for humans while you can do both with just indentation?

Performance

Nim has very high performance by default, and allow you to tune code at a very low level.

Numerical computing is, contrary to the expectations, more dependant on memory speed than on your CPU compute muscles. Today CPUs are so fast that it’s a challenge to feed them enough data in a computation cycle.

So I needed a language that allows you to tinker with memory layout, hence exit traditional high-level languages like Python, Julia, Java …

I needed to manipulate pointers and interact with C and C++ for Nvidia Cuda.

I needed a good data parallelism library because Python Global Interpreter Lock (GIL) frustrated me to no end. Basically due to the GIL to have multiprocessing in Python you have to:

What’s the result?

On exact integer matrix multiplication which seemed to be the abandoned children of tensors libraries as everyone is optimising floating points, I am 10x faster than Julia and 22x faster than Numpy.

That means that a computation that would take 1 day now only takes about a hour.

 

LanguageSpeedMemory
Nim 0.17.3 (devel) + OpenMP[39]0.36s55.5 MB
Julia v0.6.0[40]3.11s207.6 MB
Python 3.6.2 + Numpy 1.12 compiled from source[41]8.03s58.9 MB

How? Low-level memory management in Nim

With careful tuning from this high performance computing course[42] (in C, C++ and Assembler), and then going down at a very low level:

How? Data parallelism in Nim

All map, fold and reductions functions in Arraymancer are parallelized with OpenMP (thank you @edubart[43]). Nim OpenMP interface is easy, from Rosetta code:

How? Controlling overhead

Overhead comes in various forms, most notably memory allocation and function calls.

There are two types of allocations:

In many high-level language everything is on the heap.

Because it was easy, I used to allocate the tensor metadata (2D 10×100 tensor for example) on the heap.
In this benchmark[44] I was doing 10000 * 100 perceptrons computations in a loop and the program was spending more time creating and destroying memory than doing actual computation.
I moved all to stack-allocated objects and cut down the time from 45+ seconds to 13 seconds.

Also, while functional programming is nice, it is also costly.

Doing toSeq(1..1_000_000).map(square) to square a 1000000 integer sequence will forces you to do
1_000_000 function calls.

When doing a function call, the CPU saves its working area (registers) to the stack, calls the function and then must retrieve again the data from the working area, 1 million times. Remember what I said with memory being THE bottleneck?

Nim allows you to almost inline anything especially trivial computation like that.

You can rewrite map that inline its input with a template:

{.inject.} is a special notation that exposes x to the outside, it can be used like this

And that code will be transformed transparently into:

Portability

Compiles to C, C++, Objective-C and Javascript

I almost forgot, Nim compiles to optimized C, C++, Objective-C and Javascript.

Yes Javascript (and soon WebAssembly) is one of its target. Yes, javascript! If writing a NES console emulator[45] (with Super Mario Bros) in Nim, playable in the browser is possible, anything is possible.

So with Nim you can target anything from PC to embedded devices , phones to browsers and obviously graphics cards through CUDA, OpenCL or OpenGL.

Calling C, C++, Obj-C, javascript libraries is easy

Calling C from Nim is really easy, at one point I wanted the log1p function which computes ln(1+x) and does not fail badly due to rounding if x is near 0 (and is very useful in machine learning because loss functions are often in log space and you want to minimize the error). Import of C functions works like this:

Nim provides the same import facilities for C++, Obj-C and Javascript (including C++ templates)

Arraymancer

Wow already so many lines, let’s wrap up.

I started Arraymancer 6 months ago, it now has most of the building blocks ready to provide great ergonomics with blazing fast performance, on CPU and Cuda.

Actually for CPU Arraymancer 0.2.0 was already faster[46] than Facebook Torch that is written completely in C. Arraymancer master branch is probably about 2x faster than Torch now due to parallel reduction and huge optimizations in memory allocation.

Next steps: building all the tools for Convolutional Neural Networks.

Next next steps: Leveraging Nim convenient syntax for model research, robustness and scalability for production and JS backend for interoperability (REST API yeah !) in one neat tool.

The end

I hope I gave you a great overview of Nim capabilities for High-Performance Computing and next time you consider C or Fortran for your HPC needs, look at yourself and seriously consider Nim[2].

Want to use discover, discuss, try or contribute? Arraymancer’s Github is here[47].


Also published on Medium[48].

Endnotes:
  1. Arraymancer: https://github.com/mratsim/Arraymancer
  2. Nim: https://nim-lang.org/
  3. GPU virtualized in LXC containers: https://andre-ratsimbazafy.com/cuda-gpu-passthrough-to-a-lxc-container/
  4. programming and number theory, in Haskell: https://github.com/mratsim/haskell-numbertheory
  5. a go-playing bot using Monte-Carlo: https://github.com/mratsim/rustygo
  6. Data Science Bowl #3: https://www.kaggle.com/c/data-science-bowl-2017
  7. Ashish Shah: https://medium.com/towards-data-science/my-experience-participating-in-kaggle-data-science-bowl-2017-lung-cancer-detection-4705032052ec
  8. Guido Zuidhof: https://medium.com/towards-data-science/my-experience-participating-in-kaggle-data-science-bowl-2017-lung-cancer-detection-4705032052ec
  9. Arnav Jain: https://www.kaggle.com/arnavkj95/candidate-generation-and-luna16-preprocessing
  10. CS231n courses: https://www.youtube.com/playlist?list=PLkt2uSq6rBVctENoVBg1TpCC7OQi31AlC
  11. Hacker’s guide to neural network: https://karpathy.github.io/neuralnets/
  12. frustrations with Rust: https://andre-ratsimbazafy.com/why-rust-fails-hard-at-scientific-computing/
  13. here: https://github.com/mratsim/nim-rmad
  14. Toward a (smoking !) high performance numerical computing library in Nim: #toward-a-smoking--high-performance-numerical-computing-library-in-nim
  15. Forewords: #forewords
  16. Table of Contents: #table-of-contents
  17. Why Nim?: #why-nim
  18. Ergonomics: #ergonomics
  19. Cognitive overload: #cognitive-overload
  20. Generics: #generics
  21. Flexible static-typing: #flexible-static-typing
  22. Operator overloading and infix operators: #operator-overloading-and-infix-operators
  23. Functional programming facilities: #functional-programming-facilities
  24. Metaprogramming: the boilerplate squashing tool: #metaprogramming-the-boilerplate-squashing-tool
  25. Syntax for humans: #syntax-for-humans
  26. Performance: #performance
  27. What’s the result?: #whats-the-result
  28. How? Low-level memory management in Nim: #how-low-level-memory-management-in-nim
  29. How? Data parallelism in Nim: #how-data-parallelism-in-nim
  30. How? Controlling overhead: #how-controlling-overhead
  31. Portability: #portability
  32. Compiles to C, C++, Objective-C and Javascript: #compiles-to-c-c-objective-c-and-javascript
  33. Calling C, C++, Obj-C, javascript libraries is easy: #calling-c-c-obj-c-javascript-libraries-is-easy
  34. Arraymancer: #arraymancer
  35. The end: #the-end
  36. gonum library: https://github.com/gonum/gonum/blob/aff0e10c44138b1247d90efb9117f68ba7c76f0c/internal/math32/math.go#L29
  37. airplanes and space shuttle crashes: http://mentalfloss.com/article/25845/quick-6-six-unit-conversion-disasters
  38. Project Euler: https://projecteuler.net/
  39. Nim 0.17.3 (devel) + OpenMP: https://github.com/mratsim/Arraymancer/blob/master/benchmarks/integer_matmul.nim
  40. Julia v0.6.0: https://github.com/mratsim/Arraymancer/blob/master/benchmarks/integer_matmul.jl
  41. Python 3.6.2 + Numpy 1.12 compiled from source: https://github.com/mratsim/Arraymancer/blob/master/benchmarks/integer_matmul.py
  42. high performance computing course: http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/
  43. @edubart: https://github.com/edubart
  44. benchmark: https://github.com/mratsim/Arraymancer/blob/master/benchmarks/ex01_xor.nim
  45. NES console emulator: https://hookrace.net/nimes/
  46. already faster: https://github.com/edubart/arraymancer-demos
  47. here: https://github.com/mratsim/Arraymancer
  48. Medium: https://medium.com/@MARatsimbazafy/high-performance-tensor-library-in-nim-97a0c44de2f4

Source URL: https://andre-ratsimbazafy.com/high-performance-tensor-library-in-nim/