Learn NumPy in One Article

If you're using Python to do Machine Learning, Deep Learning, engineering, or anything else that has to do with math, then you need to start by learning NumPy. It is a MUST!

Why? Because NumPy is THE package that allows you to create matrices and do mathematics in an ultra-efficient way (NumPy was developed in C)! If you are an engineer, a scientist or a mathematician, you probably know it: matrices are the basis of everything.

This article represents a complete tutorial on NumPy. At the end, you will be able to create matrices, do math with them, and you will never face a bug while using NumPy!

What's on the program:

NDarray : The NumPy array in N-Dimensions:

NDArray: ndim, shape, size
generators: np.ones(), np.zeros(), np.random.randn(), ...
methods: ravel(), reshape(), concatenate(), ...

Indexing, Slicing in NumPy

Boolean indexing
Image processing example

Statistics and Linear Algebra with NumPy

argmax(), argsort(), sum()
Statistics + np.unique(), np.corrcoef()
Linear algebra

Broadcasting in NumPy

Warning! Machine learning program
Review

1. NumPy ndarray: The N-Dimensional Array

At the base of NumPy there is a very powerful object: The N-Dimensional array (ndarray).

If I say that it is a powerful object, it is because it allows to perform a lot of advanced mathematical actions, it allows to contain an infinity of data, and is very fast in execution.

In engineering, machine learning and data science, we most often work with 2 dimensional arrays (dataset, image, matrix). Sometimes in 3 dimensions (for a color image, which contains the Red, Green, Blue layers)

NumPy Array Generators

1

2

3

4

5

6

A = np.zeros((2, 3)) # an array full of 0 of 2x3 dimensions
B = np.ones((2, 3)) # an array full of 1 of 2x3 dimensions
C = np.random.randn(2, 3) # a random array (normal distribution) of 2x3 dimensions
D = np.random.rand(2, 3) # a random array (uniform distribution)
 
E = np.random.randint(0, 10, [2, 3]) #an array of random ints (0 to 10) of 2x3 dimension

It is also possible to choose the type of data we want to use for our table using the dtype parameter. This can be very important for powerful and efficient codes.

1 2	`A` `=` `np.ones((2,` `3), dtype=np.float16)` `B` `=` `np.eye(4, dtype=np.bool)` `# creates an identity matrix and converts the elements into bool type.`

The N-Dimensional array class (ndarray) offers several attributes and methods. Here are the most useful ones, which you should absolutely know!

Important Attributes of ndarray

A = np.zeros((2, 3)) # creates an array of shape (2, 3)
 
print(A.size) # the number of elements in the array A
print(A.shape) # the dimensions of array A (in Tuple form)
 
print(type(A.shape)) #here is the proof that the shape is a tuple
 
print(A.shape[0]) # the number of elements in 1st dimension of A

Important Methods of ndarray

A = np.zeros((2, 3)) # creates an array of shape (2, 3)
 
A = A.reshape((3, 2)) # reshapes the array A (3 rows, 2 columns)
A.ravel() # flattens the array A (one dimension only)
A.squeeze() # eliminates the "1" dimensions of A.

2. Indexing and Slicing in a NumPy Array

When working on a NumPy array (usually 2D), it is important to be able to navigate easily in order to manipulate the data. To do this, we move on one axis only, which leads us to change the position only according to the same axis.

Indexing

As for lists, we choose to access a particular element of the array by indicating an index for this element.

1

2

3

A = np.array([[[1, 2, 3], [4, 5, 6]])
 
print(A[0, 1]) # row 0, column 1

Slicing

In the case of slicing, we choose instead to access several elements of the same axis of the array. We often talk about subset. We must therefore indicate a start and end index for each dimension of our array

1

2

3

A = np.array([[[1, 2, 3], [4, 5, 6]])
 
print(A[0:2, 0:2]) # row 0 and 1, column 0 and 1

On the web, it is common to see "implicit" Slicing operations, in which only 1 index is given. This extracts the whole row corresponding to this index (see the example below). For better clarity, I recommend to always use explicit syntax in your codes.

1

2

3

4

5

A = np.array([[[1, 2, 3], [4, 5, 6]])
 
print(A[1,:] # prints the whole row 1, explicitly
 
print(A[1]) # prints the whole row 1, but this syntax is not ideal

Now, let's take a look at a very common technique used in Data Science: Boolean Indexing.

Boolean Indexing

When we perform a Boolean test on a NumPy array (for example A < 5) then Python produces a NumPy array of bool dtype and of the same dimension as the array A. This array is called a boolean mask.

A boolean mask can be used as a filter to perform a boolean indexing operation: if an element respects a given condition (bool = True) then it is selected to be part of the result subset.

This technique is very useful in data analysis to filter or convert the values of an array when they are below or above a certain threshold.

A = np.array([[[1, 2, 3], [4, 5, 6]])
 
print(A<5) # boolean mask
 
print(A[A < 5] # a subset filtered by the boolean mask
 
A[A<5] = 4 # converts the selected values.
print(A)

3. Mathematics Using NumPy

Basic Mathematics

Doing mathematics on NumPy is not really hard. Indeed, NumPy is full of functions to do statistics and even linear algebra.

The ndarray class itself contains most of the basic mathematical functions: sums, products, averages, standard deviation, etc. It is enough to use the following methods, each of them can be limited to one of the axes of the Numpy array

A = np.array([[[1, 2, 3], [4, 5, 6]])
 
print(A.sum()) # sums all the elements of the array
print(A.sum(axis=0) # performs the sum of the columns (sum over elements of the rows)
print(A.sum(axis=1) # performs the sum of the rows (sum over the elements of the columns)
print(A.cumsum(axis=0) # performs the cumulative sum
 
print(A.prod() # performs the product
print(A.cumprod() # performs the cumulative product
 
print(A.min() # finds the minimum of the array
print(A.max()) # finds the maximum of the array
 
print(A.mean()) # calculates the mean (average)
print(A.std()) # calculates the standard deviation
print(A.var()) # calculates the variance

Among these methods, there is also the sort() method which allows to sort an array, but even more useful: the argsort() method

NumPy argsort()

argsort() returns the indexes in the sort order of the array, without modifying it. These indexes can then be used to sort any other sequence, according to the order of the original array.

For example, it is possible to sort a whole array A according to one of the columns of A!

A = np.random.randint(0, 10, [5, 5]) # random array
print(A)
 
print(A.argsort()) #returns the indexes to sort each row of the array 
 
print(A[:,0].argsort()) #returns the indexes to sort column 0 of A
 
A = A[A[:,0].argsort(), :] # sorts the array columns by column

NumPy Statistics

Beyond the methods present in the ndarray class, NumPy offers many more advanced mathematical functions. It is possible to do statistics, linear algebra, Fourier transform, etc...

The specific statistical routines are all documented here. There are means, variances, standard deviations, but also how to calculate correlations and histograms. Among these routines, I recommend the corrcoef() function, with which you can calculate the Pearson correlation between the different rows or columns of a NumPy table.

B = np.random.randn(3, 3) # 3x3 random numbers
 
# returns the correlation matrix of B
print(np.corrcoef(B))
 
# returns the correlation matrix between rows 0 and 1 of B
print(np.corrcoef(B[:,0], B[:, 1]))
 
# selects the correlation between row 0 and row 1
print(np.corrcoef(B[:,0], B[:, 1][0,1]))

Statistics with NaN

In Data Science, it is common to have missing data in a dataset, which then appear as NaN values in a NumPy array. In these conditions, it is not possible to make statistical calculations. This is why NumPy proposes functions such as nanmean(), nanstd() and nanvar() which make it possible to carry out these calculations while ignoring the NaN values.

Linear Algebra using NumPy

Numpy also allows to do linear algebra using the routines available in the numpy.linalg package. The following routines are among the most important ones, and allow to compute the determinant of a matrix, to invert a matrix, and to compute its eigenvectors and eigenvalues.

A = np.ones((2,3))
B = np.ones((3,3))
 
print('transpose', A.T) # transpose of the matrix A
 
print('A.B product', A.dot(B)) # matrix product A.B
 
 
A = np.random.randint(0, 10, [3, 3])
 
print('det', np.linalg.det(A)) # calculates the determinant of A
print('A inv', np.linalg.inv(A)) # calculates the inverse of A
 
val, vec = np.linalg.eig(A)
print('eigenvalue', val) # eigenvalue
print('eigenvector', vec) # eigenvector

Well done! With the content of this page, you now master the essentials of NumPy to do Machine Learning, and more generally, scientific computing. Finally, it is very important to get familiar with one of the foundations of NumPy: Broadcasting.

4. Broadcasting Using NumPy

Usually, when you want to perform a mathematical operation between 2 arrays in a language like C++ (for example add them together) you have to work with for loops to access the different elements of these arrays (and add them together)

With Numpy, everything is much easier! This iteration system is already implemented inside the library (via the C language). So, to add the elements between an array A and an array B, you just have to write A + B.

Numpy Broadcasting: Rules and Examples

To perform a mathematical operation on 2 NumPy arrays, it's simple: They must have the same dimensions, and if not, Broadcasting can extend any dimension equal to 1 to cover the equivalent dimension of the other array. Here are some examples:

A = np.ones((2, 3))
B = 3
print(A+B) # result 1
 
A = np.ones((2, 3))
B = np.ones((2, 1)) # B has one column, it will be spread over the three columns of A
print(A+B) # result 2
 
A = np.ones((2, 3))
B = np.ones((2, 2))
 
print(A + B) # ERROR ! 

Be careful! Broadcasting often holds bad surprises when you forget to dimension your arrays explicitly: (3, 1) instead of (3,). For example, adding an array A of dimension (3, 1) with an array of dimension (3, ) will not give a result of dimension (3, 1), but a result of dimension (3, 3)

NumPy : Review

NumPy is the fundamental package to do scientific computing with Python. All the other important packages in Machine Learning (Pandas, Matplotlib, Sklearn) are built on the basis of NumPy. It is essential to master the basic functions presented in this article to be able to generate arrays, manipulate them (according to their rows and columns) and perform some mathematical operations on them. It is not useful for a beginner to learn by word of mouth other functions than those presented in this article (however, many educational websites drown their students in all the details of NumPy...)

If you are now familiar with NumPy, I recommend you to turn to Matplotlib, the best package to create charts in Python (and to visualize NumPy charts!).

I'll answer any question asked in comments, enjoy!

Data World