Breaking News

High performance tensor library in Nim

High performance tensor library in Nim

Toward a (smoking !) high performance tensor library in Nim


In April I started Arraymancer, yet another tensor library™, in yet another language™, Nim.

Why you ask?

I wanted to have fun. When I’m bored, I like to learn new things, research them “in-depth” and do things with that.

It’s a rabbit hole, really:

  • 14 years ago, it was video processing and encoding. I remember forcing my PC to a crawl while using Avisynth and this awesome FFT3DGPU to denoise videos taken from handheld camera.
  • 13 years ago, I learned go, the board game. After a year I was almost dan level (black belt), spent weekends and roam hundreds of km per year for tournaments including French national championship and European Go Congress.
  • 7 years ago, it was Linux, with Gentoo being my first distribution. Now I’m sitting on a server with my machine/deep learning stack and GPU virtualized in LXC containers, NAS storage and security stack virtualized via KVM.
  • 6 years ago, it was programming and number theory, in Haskell, no for loops anyone?
  • 2 years ago, still programming and algorithms: a go-playing bot using Monte-Carlo, in Rust. The borrow-checker and me are not friends.
  • 1 year ago, data science, I even bought a GPU just to do deep learning and warm up my home in winter.

In January I geared up for Data Science Bowl #3, a 1 million dollar competition to predict lung cancer sponsored by the US National Cancer Institute, downloaded 160GB of 3D lung scans and …

Damn Python/Numpy/Scipy was slow, I spent more time preprocessing the images with watershed segmentation (a full night) than training my neural network (3 hours).

Lung segmentation

Image courtesy Ashish Shah, Guido Zuidhof and Arnav Jain

That means that I couldn’t test as fast as I wished. I just stopped there and wanted to learn more about neural networks and computer vision behind the scene. I watched Standford’s CS231n courses and did the Hacker’s guide to neural network tutorial by Andrej Karpathy.

It was in Javascript, I wanted to learn another language I stumbled upon after my frustrations with Rust: Nim. So here I go.

Until I reached this:

No way I’m coding that, so I wrote an autograd to automatically differentiate functions, available here.

Then I wanted to generalize this autograd to vectors, matrices and tensors, 4D tensors especially so I could pass N images, C color channels (RGB), H height and W width.

Nim ecosystem being young (Nim is 8 years old, Python 27 years old), nothing suited my needs so I started my own with the goals of:

  • Having fun
  • Being ergonomic
  • Being fast

Table of Contents

Why Nim?


Cognitive overload

Building something fast but too hard to use is a tragedy. Scientific computing is an exercise in fast experimentations and iterations and the programming language should be in the shadows so that the scientist focus on his algorithms.

That’s an obvious fail for Rust, the borrow checker brings too much cognitive overload.


Furthermore maths and science is all about generalization and extracting overarching concepts out of things.

I don’t want to write this kind of things in 2017:

This is go by the way taken from gonum library. Just give me generics like in C++, Rust, Haskell …

Here I just tell the compiler that only signed integer (int32, int64) and floating points (float32, float64) input are valid.

Flexible static-typing

Static-typing is often seen as an obstacle in science for fast iterations. This does not fly. So that Python numerical libraries are robust, most functions in Python are buried deep in input checks to make sure the input is correct.


Furthermore, if physicists were to know about the following features we might have avoided airplanes and space shuttle crashes:

Note: Nim provides tools to facilitate implicit conversion

You probably noticed that you had to tell which operators were valid like + and *. This is actually a feature because multiplying Meters should give Meters2 not Meters

How awesome is that? Note: this is actually also useful in Finance to avoid adding two different currencies or British pounds with pennies.

Operator overloading and infix operators

In the previous section you saw * and +, Nim allows you to reuse the same operators to actually mean different things depending on the input. * between two float64 give a float64, but between Meters it gives Meters2, between matrices and a vector it gives a vector, etc.

Also Nim allows you to create your own operator by combining any number of these symbols:


So I could create proc * to mean matrix multiplication and .* to mean elementwise multiplication like in Julia and Matlab. I could even use <*> to mean Haskell applicative map but let’s not go there.

Functional programming facilities

Functions are first-class citizen in Nim and can be passed to other functions. I let my solution to Project Euler problem #5 speak for me.

Metaprogramming: the boilerplate squashing tool

I’m fond of DRY (Don’t Repeat Yourself). Object-Oriented Programming, getter, setters, etc are not for me, Java especially is awfully guilty of inflating the number of lines of code to the extreme.

I like my programs lean and fast.

So let’s say I want to implement “universal functions” in my library, in the Numpy jargon that means a function that applies element-wise to all elements of the multidimensional array, like sin, cos, exp, sqrt, ln/log


  • A way to apply a function on all elements of the array (say map)
  • Exporting that function for all to use.

How boring would it be to use that:

Instead you can use a template that will do the boilerplate for you:

You might say, that I saved 1 line per new function, useless.
Actually this is EXTREMELY useful to wrap Cuda Kernels, like so:

Suddenly it’s 10 lines saved per operands, convinced?

Ok, last try, thanks to Nim metaprogramming I could implement this very nice slicing syntax by just bending the language to my will, like Neo in the Matrix:


Bending Nim

Syntax for humans

There is a say that a dev spent 90% of his time reading code.
I read a lot, and fast. Except that when I’m reading books I don’t have to read things like this:

Too much visual noise.

Also curly braces are not sexy anymore. In any case even if there are curly braces it does not excuse you from making your code readable with indentation for logical blocks (if/else, for loop, function definition …).
Why use curly braces to tell the computer that it’s a logical block and use indentation for humans while you can do both with just indentation?


Nim has very high performance by default, and allow you to tune code at a very low level.

Numerical computing is, contrary to the expectations, more dependant on memory speed than on your CPU compute muscles. Today CPUs are so fast that it’s a challenge to feed them enough data in a computation cycle.

So I needed a language that allows you to tinker with memory layout, hence exit traditional high-level languages like Python, Julia, Java …

I needed to manipulate pointers and interact with C and C++ for Nvidia Cuda.

I needed a good data parallelism library because Python Global Interpreter Lock (GIL) frustrated me to no end. Basically due to the GIL to have multiprocessing in Python you have to:

  • Either copy the data on each of your cores, good luck with a 8Go batch of images.
  • Code in C or Cython and tell Python not to lock.

What’s the result?

On exact integer matrix multiplication which seemed to be the abandoned children of tensors libraries as everyone is optimising floating points, I am 10x faster than Julia and 22x faster than Numpy.

That means that a computation that would take 1 day now only takes about a hour.


Nim 0.17.3 (devel) + OpenMP0.36s55.5 MB
Julia v0.6.03.11s207.6 MB
Python 3.6.2 + Numpy 1.12 compiled from source8.03s58.9 MB

How? Low-level memory management in Nim

With careful tuning from this high performance computing course (in C, C++ and Assembler), and then going down at a very low level:

How? Data parallelism in Nim

All map, fold and reductions functions in Arraymancer are parallelized with OpenMP (thank you @edubart). Nim OpenMP interface is easy, from Rosetta code:

How? Controlling overhead

Overhead comes in various forms, most notably memory allocation and function calls.

There are two types of allocations:

  • the stack, very fast, but small (1MB to 8MB).
  • the heap, slow, as big as your RAM.

In many high-level language everything is on the heap.

Because it was easy, I used to allocate the tensor metadata (2D 10×100 tensor for example) on the heap.
In this benchmark I was doing 10000 * 100 perceptrons computations in a loop and the program was spending more time creating and destroying memory than doing actual computation.
I moved all to stack-allocated objects and cut down the time from 45+ seconds to 13 seconds.

Also, while functional programming is nice, it is also costly.

Doing toSeq(1..1_000_000).map(square) to square a 1000000 integer sequence will forces you to do
1_000_000 function calls.

When doing a function call, the CPU saves its working area (registers) to the stack, calls the function and then must retrieve again the data from the working area, 1 million times. Remember what I said with memory being THE bottleneck?

Nim allows you to almost inline anything especially trivial computation like that.

You can rewrite map that inline its input with a template:

{.inject.} is a special notation that exposes x to the outside, it can be used like this

And that code will be transformed transparently into:


Compiles to C, C++, Objective-C and Javascript

I almost forgot, Nim compiles to optimized C, C++, Objective-C and Javascript.

Yes Javascript (and soon WebAssembly) is one of its target. Yes, javascript! If writing a NES console emulator (with Super Mario Bros) in Nim, playable in the browser is possible, anything is possible.

So with Nim you can target anything from PC to embedded devices , phones to browsers and obviously graphics cards through CUDA, OpenCL or OpenGL.

Calling C, C++, Obj-C, javascript libraries is easy

Calling C from Nim is really easy, at one point I wanted the log1p function which computes ln(1+x) and does not fail badly due to rounding if x is near 0 (and is very useful in machine learning because loss functions are often in log space and you want to minimize the error). Import of C functions works like this:

Nim provides the same import facilities for C++, Obj-C and Javascript (including C++ templates)


Wow already so many lines, let’s wrap up.

I started Arraymancer 6 months ago, it now has most of the building blocks ready to provide great ergonomics with blazing fast performance, on CPU and Cuda.

Actually for CPU Arraymancer 0.2.0 was already faster than Facebook Torch that is written completely in C. Arraymancer master branch is probably about 2x faster than Torch now due to parallel reduction and huge optimizations in memory allocation.

Next steps: building all the tools for Convolutional Neural Networks.

Next next steps: Leveraging Nim convenient syntax for model research, robustness and scalability for production and JS backend for interoperability (REST API yeah !) in one neat tool.

The end

I hope I gave you a great overview of Nim capabilities for High-Performance Computing and next time you consider C or Fortran for your HPC needs, look at yourself and seriously consider Nim.

Want to use discover, discuss, try or contribute? Arraymancer’s Github is here.

Also published on Medium.

Tags assigned to this article:
deep learningmachine learningnimprogramming

Related Articles

Predicting apartment interest from listing with structured data, free text, geolocalization, time data and images

3 months ago, Data Science competition website Kaggle published a challenge from Two Sigma, an investment fund, and Renthop, a

Data Science Bowl 2017 – Space-Time tricks

Here we are again for the third post on my journey to Deep Learning. Like all super-heroes, the data scientist

Journey to Deep Learning #2: Don’t fight the wrong fights

2 months ago, I took my courage and threw it at Machine Learning. Machine Learning, at least the supervised learning,

  • Benjiro

    Can you also take a look at Crystal ( )? Because i think it will fit just as well as Nim and may be even more speedy in your case.

    • Mamy Ratsimbazafy

      I would probably use Crystal if I didn’t hear of Nim first actually. Both strike an excellent balance of ergonomics, flexibility and speed. The other language with those characteristics would be D.

      Regarding speed, I don’t think Crystal would be faster if you cannot solve those questions:
      – How to insert OpenMP from Crystal ?
      – Is it possible to inline function argument (closures) passed to an higher order function or is there a metaprogramming facility like Nim templates: `let seq_square = toSeq(1..20).map_inline(x * x)
      – Is it possible to pass C pragmas like __restrict__ or __attribute_aligned(64) to LLVM

      Not having these prevents you from max performance while not dropping to Assembly.

      Now the really killer features of Nim compared to Crystal (at first glance, I didn’t actually code Crystal):
      – C++ bindings: This is key for custom Cuda GPU kernels as I need to create C++ Functors
      – Nim compiles very very fast and it is important for rapid testing:,,
      – JS target and calling NodeJS packages. Though Crystal can target JS through WebAssembly/Emscripten/LLVM, I don’t think you can have bindings with Node.js packages right?