Numpy

For technical reasons, pure Python is not fast enough for numerical data analysis. To work around this problem various packages were developed. Most packages are based on the API, which was developed and implemented by Numpy. Therefore we are concentrating in particular on numpy.

Numpy is a package to compute efficiently on multi-dimensional arrays (tensors) with Python. It offers various features, some of them listed below:

  • types:

    • numeric data types of different precisions

    • ndarray: multi-dimensional array of a particular type.

    • masked Arrays: array that may have missing values, e.g. for missing data or land masks

  • functions and methods:

    • ufunc: operate element wise on arrays (+, -, sin, cos, …)

    • ndarray methoden: e.g. mean, sum, max, transpose, dot …

    • linear algebra (matrix inversion, etc.)

Before we can use numpy, we have to import it.

import numpy as np

Ndarray

The ndarray class is the foundation of numpy. A ndarray object represents a multidimensional and homogeneous array consisting of elements of fixed size. This means that all elements are of the same type.

Creating Arrays

To create an ndarray object, the following functions are used:

a1 = np.array([1, 2, 3])

a2 = np.array([1., 2, 3])

# Array containing only zeros
z = np.zeros((3, 2))

# Array containing only ones
o = np.ones((5, 3))

# Array containing only zeros with the same metadata as o
zeros_from_ones = np.zeros_like(o)

# Array without initial values
buffer = np.empty((3, 2))

# 10x10 identity matrix
a = np.eye(10)

# As range(), but also for floats
x1 = np.arange(1, 10, 0.1)

x2 = np.linspace(0, 2 * np.pi, 100)

y = np.sin(x1)

Note

Arrays can also be initialized via a sequence, such as a list or a tuple, or other iterable types, e.g. a range.

print(np.array(range(3)))
[0 1 2]

Attributes

Arrays have attributes which contain its meta data. These attributes are:

  • ndim: Number of dimensions

  • shape: tuple of the length of the dimensions

  • size: number of all elements

  • dtype: data type of all elements

  • itemsize: size of the elements in bytes

a = np.array([[2, 3, 1], [5, 7, 3]])

print(a)
print(a.ndim)
print(a.shape)
print(a.size)
print(a.dtype)
print(a.itemsize)
[[2 3 1]
 [5 7 3]]
2
(2, 3)
6
int64
8

Indexing and Slicing

Indexing and slicing works in the same way as it does for sequences.

a = np.array(range(9), dtype=np.int)
print(a)
print(a[4])
print(a[1:3])
print(a[::-1])
[0 1 2 3 4 5 6 7 8]
4
[1 2]
[8 7 6 5 4 3 2 1 0]

Numpy array are mutable which means that we can change elements of an array after it has been created.

Note

When changing an element of an array, the data will be casted to the type of the array. This may cause loss of information as shown below.

a = np.array(range(9))
print(a)

a[5] = 3.4
print(a)
[0 1 2 3 4 5 6 7 8]
[0 1 2 3 4 3 6 7 8]

Warning

Unlike for lists, slices of an numpy array are references to the slice of the original array. Hence changing data of a slice also changes the original array!

a = np.array(range(9))

b = a[1:2]
b[0] = 20

print(b)
print(a)
[20]
[ 0 20  2  3  4  5  6  7  8]

Multi-dimensional arrays may have one index per axis. Tuples of indices are automatically filled with : to the right.

a = np.random.random((2, 3, 4))

print(a)
print(a[1, 2, 1])
print(a[:, 1, 1])
print(a[1:, :, :])

print(a[-1])
[[[0.75927542 0.67008731 0.06948295 0.3219854 ]
  [0.8164175  0.89856409 0.48034308 0.85270738]
  [0.13092369 0.29600006 0.6499389  0.10710406]]

 [[0.2234761  0.95906661 0.68333508 0.70382561]
  [0.59257239 0.62352223 0.95006678 0.86956172]
  [0.06296639 0.27606878 0.12782001 0.62795607]]]
0.2760687789695674
[0.89856409 0.62352223]
[[[0.2234761  0.95906661 0.68333508 0.70382561]
  [0.59257239 0.62352223 0.95006678 0.86956172]
  [0.06296639 0.27606878 0.12782001 0.62795607]]]
[[0.2234761  0.95906661 0.68333508 0.70382561]
 [0.59257239 0.62352223 0.95006678 0.86956172]
 [0.06296639 0.27606878 0.12782001 0.62795607]]

Sometimes we may want to write code that does not assume a particular number of dimensions for the arrays. Here, the ... notation is useful. It is used as an dummy for as many : as necessary to fill the index tuple.

print(a[..., 0])
print(a[:2, ..., 1])
[[0.75927542 0.8164175  0.13092369]
 [0.2234761  0.59257239 0.06296639]]
[[0.67008731 0.89856409 0.29600006]
 [0.95906661 0.62352223 0.27606878]]
x = np.arange(2 * 3 * 2 * 5 * 4).reshape(2, 3, 2, 5, 4)

print((x[1, 2, ...] == x[1, 2, :, :, :]).all())
print((x[..., 3] == x[:, :, :, :, 3]).all())
print((x[1, ..., 4, :] == x[1, :, :, 4, :]).all())
True
True
True

Iterating over arrays

We can iterate over the elements of an one-dimensional array in the same way as we do for lists.

a = np.arange(10) ** 3

for e in a:
    print(e ** (1 / 3.))
0.0
1.0
2.0
3.0
3.9999999999999996
4.999999999999999
5.999999999999999
6.999999999999999
7.999999999999999
8.999999999999998

Multi-dimensional arrays are iterated over with respect to the first axis. If we are iterating over an array of shape (2, 3), the for loop will consist of two iterations and the variable in within the loop body will be an array of shape (3,).

a = np.arange(5 * 3).reshape((3, 5))
for row in a:
    print(row)
[0 1 2 3 4]
[5 6 7 8 9]
[10 11 12 13 14]

To iterate over all elements of an array, the attribute flat can be used.

for element in a.flat:
    print(element)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14

Reshape

Sometimes it is useful to change the shape of the array. Therefore numpy offers the following options:

  • The shape attribute of the array can be overwritten.

  • transpose returns the transposed array. The order of the axis is reversed.

  • reshape returns a reference to the same data having its shape changed.

  • resize changes the size of the array itself and returns None.

Note

When changing the shape of an array, the new shape must be compatible with the old one. That is, there must fit the same number of elements in both shapes. If the new shape is not compatible, a ValueError is raised.

a = np.arange(3 * 4).reshape((3, 4))
print(a)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
a.shape = (3, 2, 1, 2)
print(a)
[[[[ 0  1]]

  [[ 2  3]]]


 [[[ 4  5]]

  [[ 6  7]]]


 [[[ 8  9]]

  [[10 11]]]]
print(a.transpose())
[[[[ 0  4  8]
   [ 2  6 10]]]


 [[[ 1  5  9]
   [ 3  7 11]]]]
a.resize((2, 6))
print(a)
[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]]
b = a.reshape((3, -1))
print(b.shape)
(3, 4)
# b references to the same data as a!
b[2, 2] = 100
print(a)
[[  0   1   2   3   4   5]
 [  6   7   8   9 100  11]]

Operators

Numpy offers the same arithmetic and logic operators which we already know from the basic python data types. These operators always work element wise.

a = np.arange(4)
b = np.array([2, 7, 4.5, 1])
print(a - b)
[-2.  -6.  -2.5  2. ]
print(a * b)
[0. 7. 9. 3.]
print(b ** 2)
[ 4.   49.   20.25  1.  ]
print(b ** a)
[ 1.    7.   20.25  1.  ]
print(a % 3)
[0 1 2 0]
print(10 * np.sin(a))
[0.         8.41470985 9.09297427 1.41120008]
print(b <= 4)
[ True False False  True]
# in-place addition
b += a
print(b)
[2.  8.  6.5 4. ]

In order to be able to perform element wise operations between two arrays, the shapes of these arrays must be compatible. If they are not compatible, a ValueError is raised.

a = np.arange(3)
b = np.arange(2)
a + b
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-68-d99fc880d022> in <module>
      1 a = np.arange(3)
      2 b = np.arange(2)
----> 3 a + b

ValueError: operands could not be broadcast together with shapes (3,) (2,) 

Most unary operators (operators which have only one argument) are implemented as a method of the ndarray class. In many cases it is possible to apply the operators along one or more specific axis via the axis keyword.

a = np.array([[1, 2], [3, 4], [5, 6]])
print(a)
[[1 2]
 [3 4]
 [5 6]]
# sum of all elements
print(a.sum())
21
# sum along the first dimension
print(a.sum(axis=0))
[ 9 12]
# global minimum
print(a.min())
1
# global maximum
print(a.max())
6
# mean
print(a.mean())
3.5
# cummultative sum along the second dimension
print(a.cumsum(axis=1))
[[ 1  3]
 [ 3  7]
 [ 5 11]]
# dot product (actually a binary operator)
b = np.arange(2)
print(a.dot(b), np.dot(a, b))
[2 4 6] [2 4 6]

Masked arrays

MaskedArray is a subclass of ndarray. It extends the ndarray class by the possibility to mask elements. These are neglected in most of the computations. Functions which create or deal with masked arrays are available in the numpy.ma module. For most of the functions that create ndarrays there exist a counterpart for MaskedArrays.

a = np.ma.ones((2, 2))

Elements can be masked by assigning the special value numpy.ma.masked to them.

# mask a value
a[0, 1] = np.ma.masked
print(a)
print(a.sum())
[[1.0 --]
 [1.0 1.0]]
3.0

The mask is available as an attribute of the MaskedArray object. It is an array of the same shape of type boolean. Unmasking of masked values can be done by assigning False to elements of the mask.

a *= 2
print(a)

a.mask[0, 1] = False
print(a)
[[4.0 2.0]
 [4.0 4.0]]
[[4.0 2.0]
 [4.0 4.0]]

Note

By masking a value the actual element is not changed but only marked as masked. Element wise operations on a masked array only affect unmasked elements.

Numpy functions

Numpy offers a huge set of functions covering almost all fundamental computational demands in science. Many of these functions are grouped in modules within the numpy package. The following list contains links to the official documentation. Take some time to see what is offered. There is a solution for many problems you will have to solve in the future.

Note

The better you know the numpy library the easier it will be for you to write scientific software, e.g. for analyzing and visualizing data.

Broadcasting

Numpy functions or element wise operations expect arrays of the compatible shape. In many cases we have data which does not have the same shape. Consider for example a sea surface temperature field which has the shape (nlat, nlon), with the corresponding longitude vector of shape (nlon,) and latitude vector of shape (nlat,). Nevertheless we want to be able to do things like the following

mean_sst = (sst * np.cos(lat)).mean()

To enable such computations, numpy offers the so-called broadcasting. It follows these rules:

  • If the input arrays do not have the same number of dimensions, a “1” is appended at the beginning of the shape of all arrays with less dimensions until they have the same rank.

  • The dimensions with length “1” are treated as if they were as long as the corresponding dimension of the largest array.

sst = np.arange(10*20).reshape(10, 20)
lat = np.linspace(-np.pi/2, np.pi/2, 20)

mean_sst = (sst * np.cos(lat)).mean()

Because new dimensions are always appended at the beginning of the shape, it can happen, that one has to append a new dimension by hand. This can be achieved by indexing via the special constant newaxis.

a = np.arange(3)
print(a.shape, a[:, np.newaxis].shape)
(3,) (3, 1)

Here is a more complex example:

a = np.arange(3 * 4 * 7).reshape(3, 4, 7)
b = np.arange(7)
c = np.arange(4)
d = np.arange(3 * 7).reshape((3, 7))

print(a.shape, b.shape, c.shape, d.shape)
print(a * b)
(3, 4, 7) (7,) (4,) (3, 7)
[[[  0   1   4   9  16  25  36]
  [  0   8  18  30  44  60  78]
  [  0  15  32  51  72  95 120]
  [  0  22  46  72 100 130 162]]

 [[  0  29  60  93 128 165 204]
  [  0  36  74 114 156 200 246]
  [  0  43  88 135 184 235 288]
  [  0  50 102 156 212 270 330]]

 [[  0  57 116 177 240 305 372]
  [  0  64 130 198 268 340 414]
  [  0  71 144 219 296 375 456]
  [  0  78 158 240 324 410 498]]]
# This does not work!
print(a * c)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-108-f6d9bab54ad1> in <module>
      1 # This does not work!
----> 2 print(a * c)

ValueError: operands could not be broadcast together with shapes (3,4,7) (4,) 
# ...but this does!
print(a * c[:, np.newaxis])
[[[  0   0   0   0   0   0   0]
  [  7   8   9  10  11  12  13]
  [ 28  30  32  34  36  38  40]
  [ 63  66  69  72  75  78  81]]

 [[  0   0   0   0   0   0   0]
  [ 35  36  37  38  39  40  41]
  [ 84  86  88  90  92  94  96]
  [147 150 153 156 159 162 165]]

 [[  0   0   0   0   0   0   0]
  [ 63  64  65  66  67  68  69]
  [140 142 144 146 148 150 152]
  [231 234 237 240 243 246 249]]]
print(a * b  * c[:, np.newaxis])
[[[   0    0    0    0    0    0    0]
  [   0    8   18   30   44   60   78]
  [   0   30   64  102  144  190  240]
  [   0   66  138  216  300  390  486]]

 [[   0    0    0    0    0    0    0]
  [   0   36   74  114  156  200  246]
  [   0   86  176  270  368  470  576]
  [   0  150  306  468  636  810  990]]

 [[   0    0    0    0    0    0    0]
  [   0   64  130  198  268  340  414]
  [   0  142  288  438  592  750  912]
  [   0  234  474  720  972 1230 1494]]]
print(a * b  * c[:, np.newaxis] * d[:, np.newaxis, :])
[[[    0     0     0     0     0     0     0]
  [    0     8    36    90   176   300   468]
  [    0    30   128   306   576   950  1440]
  [    0    66   276   648  1200  1950  2916]]

 [[    0     0     0     0     0     0     0]
  [    0   288   666  1140  1716  2400  3198]
  [    0   688  1584  2700  4048  5640  7488]
  [    0  1200  2754  4680  6996  9720 12870]]

 [[    0     0     0     0     0     0     0]
  [    0   960  2080  3366  4824  6460  8280]
  [    0  2130  4608  7446 10656 14250 18240]
  [    0  3510  7584 12240 17496 23370 29880]]]