ND4J Basics with Scala
It’s no secret that one of the biggest advantages of languages like Python, R or Matlab in the machine learning and data science community is the easiness with which they can manipulate big and complex numerical structures. Although Matlab and R have most of this behavior built-in, Python counts with the SciPy stack to do the trick, and it works wonders!
Unfortunately, a usability gap has prevented Java, Scala, Clojure and, in general, JVM developers from accessing this power for a long time. This is where ND4J, a scientific computing library for the Java Virtual Machine (JVM) excels: Bringing the intuitive scientific computing tools of the Python community to the JVM in an open source, distributed and GPU-enabled library. This library is designed to run fast in production environments. Among its main features are:
- A versatile n-dimensional array object.
- Multi-platform functionality, including GPUs.
- Linear algebra and signal processing functions.
- Integrates pretty well with Apache Hadoop and Apache Spark.
Although there’s an implementation for Scala (called ND4S), is not as idiomatic as I’d like, so we’ll stick with the Java version, but remain using Scala.
The biggest player in ND4J is a data structure called NDArray. Essentially, it’s an n-dimensional array of numbers. An NDArray is described by the following properties:
- Rank: Is the number of dimensions of the array. For instance, a column or row vector has rank 1. A matrix has rank 2. A cube has rank 3.
- Shape: It describes the capacity of the NDArray in each dimension. The shape will have as many components as the NDArray rank. For instance, an NDArray of rank 2 can have 3 rows and 3 columns, so its shape is said to be (3, 3). A five-dimensional NDArray (basically, a tensor of rank 5) may have the following shape: (2, 3, 45, 2, 1).
- Length: Defines the total number of elements in the NDArray. In other words, it’s the product of the values that make up the shape.
- Stride: This is a parameter that states how ND4J operates internally and you shouldn’t concern with it most of the time. The stride of an NDArray is defined as the separation (in the underlying data buffer) of contiguous elements in each dimension. So, an NDArray of rank R has R stride values, one per dimension.
- Data type: Another internal parameter of ND4J. It’s a global configuration parameter that signals ND4J if it should store numbers as floats or doubles.
Operation types in ND4J
There are three types of operations used in ND4J:
- Scalars: Takes just two arguments: the input and the scalar to be applied to that input.
- Transforms: They take a single argument and perform an operation on it. Example: Sigmoid function.
- Accumulations: Also known as folds or reductions, they add arrays and vectors to one another and can reduce the dimensions of those arrays in the result by adding their elements in a row-wise operation. Accumulations can also be either pair-wise or scalar.
There are many ways of creating NDArrays. The first one is using the ND4J class, which has a variety of helper static methods that make the task of creating NDArrays a whole lot easier. Let’s explore some of them:
Let’s create a single column vector of 10 elements:
What about a matrix?
In fact, we can create a tensor of any shape:
If we want to initialize a tensor with another number different than zero, we can just add it after it has been created:
Here the “i” stands for “in-place”. Using add (without “i”) would have created a copy.
We can initialize with random numbers:
If we want to generate a random tensor from a Gaussian distribution, we use the randn method instead:
We can also create NDArrays with the create() method. The first way is to just pass an array of numbers and an array describing the shape, then ND4J will make sure these numbers are properly organized in a tensor of the shape provided:
If we have an array of any dimensions that’s native to Scala or Java (or Clojure), we can transform it into an NDArray also with the create() method:
Basic manipulation of NDArrays
There are several ways of manipulating values inside an NDArray. Let’s explore some of them:
Let’s use this matrix for our examples:
If we want to get a column or a row, we just need to specify its index:
We can fetch multiple rows or columns. We just pass multiple indices:
If we want to get a single value, we just specify an index for each dimension. In this case, we are working with a matrix, so we need to supply two indices (row and column):
Modifying a single value in a row or a column WILL AFFECT the tensor because each row or column obtained is a VIEW of the actual NDArray.
Of course, we can also perform basic arithmetic operations:
As you can see, hit the ground running with ND4J is really easy. Its clear API and its intuitive design make big number structures manipulation a less painful operation, much closer to the alternatives mentioned at the beginning of this article. You can get the code used in this post here. There’s a Scala worksheet with all these examples, so you can play and keep exploring. Have fun!
What do you think about ND4J? Have you worked with it before? Let me know in the comments. I’d love to know!
See you soon!
About the Author
Jesús Martínez is the creator of DataSmarts, a site for people that have a passion for machine learning and also for the JVM, and don't want to drop any of them in order to have fun! When he is not blogging, toying around with some new algorithm or working on some (very) cool side project, he enjoys listening to The Beatles and binge-watching the trendiest show on Netflix.