Numpy
Contents
Numpy¶
For technical reasons, pure Python is not fast enough for numerical data analysis. To work around this problem various packages were developed. Most packages are based on the API, which was developed and implemented by Numpy. Therefore we are concentrating in particular on numpy.
Numpy is a package to compute efficiently on multi-dimensional arrays (tensors) with Python. It offers various features, some of them listed below:
types:
numeric data types of different precisions
ndarray: multi-dimensional array of a particular type.
masked Arrays: array that may have missing values, e.g. for missing data or land masks
functions and methods:
ufunc: operate element wise on arrays (+, -, sin, cos, …)
ndarray methoden: e.g. mean, sum, max, transpose, dot …
linear algebra (matrix inversion, etc.)
…
Before we can use numpy, we have to import it.
import numpy as np
Ndarray¶
The ndarray class is the foundation of numpy. A ndarray object represents a multidimensional and homogeneous array consisting of elements of fixed size. This means that all elements are of the same type.
Creating Arrays¶
To create an ndarray object, the following functions are used:
a1 = np.array([1, 2, 3])
a2 = np.array([1., 2, 3])
# Array containing only zeros
z = np.zeros((3, 2))
# Array containing only ones
o = np.ones((5, 3))
# Array containing only zeros with the same metadata as o
zeros_from_ones = np.zeros_like(o)
# Array without initial values
buffer = np.empty((3, 2))
# 10x10 identity matrix
a = np.eye(10)
# As range(), but also for floats
x1 = np.arange(1, 10, 0.1)
x2 = np.linspace(0, 2 * np.pi, 100)
y = np.sin(x1)
Note
Arrays can also be initialized via a sequence, such as a list or a tuple, or other iterable types, e.g. a range.
print(np.array(range(3)))
[0 1 2]
Attributes¶
Arrays have attributes which contain its meta data. These attributes are:
ndim
: Number of dimensionsshape
: tuple of the length of the dimensionssize
: number of all elementsdtype
: data type of all elementsitemsize
: size of the elements in bytes
a = np.array([[2, 3, 1], [5, 7, 3]])
print(a)
print(a.ndim)
print(a.shape)
print(a.size)
print(a.dtype)
print(a.itemsize)
[[2 3 1]
[5 7 3]]
2
(2, 3)
6
int64
8
Indexing and Slicing¶
Indexing and slicing works in the same way as it does for sequences.
a = np.array(range(9), dtype=np.int)
print(a)
print(a[4])
print(a[1:3])
print(a[::-1])
[0 1 2 3 4 5 6 7 8]
4
[1 2]
[8 7 6 5 4 3 2 1 0]
Numpy array are mutable
which means that we can change elements of an array after it has been created.
Note
When changing an element of an array, the data will be casted to the type of the array. This may cause loss of information as shown below.
a = np.array(range(9))
print(a)
a[5] = 3.4
print(a)
[0 1 2 3 4 5 6 7 8]
[0 1 2 3 4 3 6 7 8]
Warning
Unlike for lists, slices of an numpy array are references to the slice of the original array. Hence changing data of a slice also changes the original array!
a = np.array(range(9))
b = a[1:2]
b[0] = 20
print(b)
print(a)
[20]
[ 0 20 2 3 4 5 6 7 8]
Multi-dimensional arrays may have one index per axis. Tuples of indices are automatically filled with :
to the right.
a = np.random.random((2, 3, 4))
print(a)
print(a[1, 2, 1])
print(a[:, 1, 1])
print(a[1:, :, :])
print(a[-1])
[[[0.75927542 0.67008731 0.06948295 0.3219854 ]
[0.8164175 0.89856409 0.48034308 0.85270738]
[0.13092369 0.29600006 0.6499389 0.10710406]]
[[0.2234761 0.95906661 0.68333508 0.70382561]
[0.59257239 0.62352223 0.95006678 0.86956172]
[0.06296639 0.27606878 0.12782001 0.62795607]]]
0.2760687789695674
[0.89856409 0.62352223]
[[[0.2234761 0.95906661 0.68333508 0.70382561]
[0.59257239 0.62352223 0.95006678 0.86956172]
[0.06296639 0.27606878 0.12782001 0.62795607]]]
[[0.2234761 0.95906661 0.68333508 0.70382561]
[0.59257239 0.62352223 0.95006678 0.86956172]
[0.06296639 0.27606878 0.12782001 0.62795607]]
Sometimes we may want to write code that does not assume a particular number of dimensions for the arrays.
Here, the ...
notation is useful. It is used as an dummy for as many :
as necessary to fill the index tuple.
print(a[..., 0])
print(a[:2, ..., 1])
[[0.75927542 0.8164175 0.13092369]
[0.2234761 0.59257239 0.06296639]]
[[0.67008731 0.89856409 0.29600006]
[0.95906661 0.62352223 0.27606878]]
x = np.arange(2 * 3 * 2 * 5 * 4).reshape(2, 3, 2, 5, 4)
print((x[1, 2, ...] == x[1, 2, :, :, :]).all())
print((x[..., 3] == x[:, :, :, :, 3]).all())
print((x[1, ..., 4, :] == x[1, :, :, 4, :]).all())
True
True
True
Iterating over arrays¶
We can iterate over the elements of an one-dimensional array in the same way as we do for lists.
a = np.arange(10) ** 3
for e in a:
print(e ** (1 / 3.))
0.0
1.0
2.0
3.0
3.9999999999999996
4.999999999999999
5.999999999999999
6.999999999999999
7.999999999999999
8.999999999999998
Multi-dimensional arrays are iterated over with respect to the first axis.
If we are iterating over an array of shape (2, 3)
, the for
loop will consist of two iterations and the variable in within the loop body will be an array of shape (3,)
.
a = np.arange(5 * 3).reshape((3, 5))
for row in a:
print(row)
[0 1 2 3 4]
[5 6 7 8 9]
[10 11 12 13 14]
To iterate over all elements of an array, the attribute flat
can be used.
for element in a.flat:
print(element)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Reshape¶
Sometimes it is useful to change the shape of the array. Therefore numpy offers the following options:
The
shape
attribute of the array can be overwritten.transpose
returns the transposed array. The order of the axis is reversed.reshape returns a reference to the same data having its shape changed.
resize changes the size of the array itself and returns
None
.
Note
When changing the shape of an array, the new shape must be compatible with the old one. That is, there must fit the same number of elements in both shapes.
If the new shape is not compatible, a ValueError
is raised.
a = np.arange(3 * 4).reshape((3, 4))
print(a)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
a.shape = (3, 2, 1, 2)
print(a)
[[[[ 0 1]]
[[ 2 3]]]
[[[ 4 5]]
[[ 6 7]]]
[[[ 8 9]]
[[10 11]]]]
print(a.transpose())
[[[[ 0 4 8]
[ 2 6 10]]]
[[[ 1 5 9]
[ 3 7 11]]]]
a.resize((2, 6))
print(a)
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]]
b = a.reshape((3, -1))
print(b.shape)
(3, 4)
# b references to the same data as a!
b[2, 2] = 100
print(a)
[[ 0 1 2 3 4 5]
[ 6 7 8 9 100 11]]
Operators¶
Numpy offers the same arithmetic and logic operators which we already know from the basic python data types. These operators always work element wise.
a = np.arange(4)
b = np.array([2, 7, 4.5, 1])
print(a - b)
[-2. -6. -2.5 2. ]
print(a * b)
[0. 7. 9. 3.]
print(b ** 2)
[ 4. 49. 20.25 1. ]
print(b ** a)
[ 1. 7. 20.25 1. ]
print(a % 3)
[0 1 2 0]
print(10 * np.sin(a))
[0. 8.41470985 9.09297427 1.41120008]
print(b <= 4)
[ True False False True]
# in-place addition
b += a
print(b)
[2. 8. 6.5 4. ]
In order to be able to perform element wise operations between two arrays, the shapes of these arrays must be compatible.
If they are not compatible, a ValueError
is raised.
a = np.arange(3)
b = np.arange(2)
a + b
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-68-d99fc880d022> in <module>
1 a = np.arange(3)
2 b = np.arange(2)
----> 3 a + b
ValueError: operands could not be broadcast together with shapes (3,) (2,)
Most unary operators (operators which have only one argument) are implemented as a method of the ndarray class.
In many cases it is possible to apply the operators along one or more specific axis via the axis
keyword.
a = np.array([[1, 2], [3, 4], [5, 6]])
print(a)
[[1 2]
[3 4]
[5 6]]
# sum of all elements
print(a.sum())
21
# sum along the first dimension
print(a.sum(axis=0))
[ 9 12]
# global minimum
print(a.min())
1
# global maximum
print(a.max())
6
# mean
print(a.mean())
3.5
# cummultative sum along the second dimension
print(a.cumsum(axis=1))
[[ 1 3]
[ 3 7]
[ 5 11]]
# dot product (actually a binary operator)
b = np.arange(2)
print(a.dot(b), np.dot(a, b))
[2 4 6] [2 4 6]
Masked arrays¶
MaskedArray is a subclass of ndarray.
It extends the ndarray class by the possibility to mask elements. These are neglected in most of the computations.
Functions which create or deal with masked arrays are available in the numpy.ma
module.
For most of the functions that create ndarrays there exist a counterpart for MaskedArrays.
a = np.ma.ones((2, 2))
Elements can be masked by assigning the special value numpy.ma.masked
to them.
# mask a value
a[0, 1] = np.ma.masked
print(a)
print(a.sum())
[[1.0 --]
[1.0 1.0]]
3.0
The mask is available as an attribute of the MaskedArray object.
It is an array of the same shape of type boolean.
Unmasking of masked values can be done by assigning False
to elements of the mask.
a *= 2
print(a)
a.mask[0, 1] = False
print(a)
[[4.0 2.0]
[4.0 4.0]]
[[4.0 2.0]
[4.0 4.0]]
Note
By masking a value the actual element is not changed but only marked as masked. Element wise operations on a masked array only affect unmasked elements.
Numpy functions¶
Numpy offers a huge set of functions covering almost all fundamental computational demands in science. Many of these functions are grouped in modules within the numpy package. The following list contains links to the official documentation. Take some time to see what is offered. There is a solution for many problems you will have to solve in the future.
Note
The better you know the numpy library the easier it will be for you to write scientific software, e.g. for analyzing and visualizing data.
Broadcasting¶
Numpy functions or element wise operations expect arrays of the compatible shape. In many cases we have data which does not have the same shape. Consider for example a sea surface temperature field which has the shape (nlat, nlon)
, with the corresponding longitude vector of shape (nlon,)
and latitude vector of shape (nlat,)
. Nevertheless we want to be able to do things like the following
mean_sst = (sst * np.cos(lat)).mean()
To enable such computations, numpy
offers the so-called broadcasting.
It follows these rules:
If the input arrays do not have the same number of dimensions, a “1” is appended at the beginning of the shape of all arrays with less dimensions until they have the same rank.
The dimensions with length “1” are treated as if they were as long as the corresponding dimension of the largest array.
sst = np.arange(10*20).reshape(10, 20)
lat = np.linspace(-np.pi/2, np.pi/2, 20)
mean_sst = (sst * np.cos(lat)).mean()
Because new dimensions are always appended at the beginning of the shape, it can happen, that one has to append a new dimension by hand. This can be achieved by indexing via the special constant newaxis
.
a = np.arange(3)
print(a.shape, a[:, np.newaxis].shape)
(3,) (3, 1)
Here is a more complex example:
a = np.arange(3 * 4 * 7).reshape(3, 4, 7)
b = np.arange(7)
c = np.arange(4)
d = np.arange(3 * 7).reshape((3, 7))
print(a.shape, b.shape, c.shape, d.shape)
print(a * b)
(3, 4, 7) (7,) (4,) (3, 7)
[[[ 0 1 4 9 16 25 36]
[ 0 8 18 30 44 60 78]
[ 0 15 32 51 72 95 120]
[ 0 22 46 72 100 130 162]]
[[ 0 29 60 93 128 165 204]
[ 0 36 74 114 156 200 246]
[ 0 43 88 135 184 235 288]
[ 0 50 102 156 212 270 330]]
[[ 0 57 116 177 240 305 372]
[ 0 64 130 198 268 340 414]
[ 0 71 144 219 296 375 456]
[ 0 78 158 240 324 410 498]]]
# This does not work!
print(a * c)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-108-f6d9bab54ad1> in <module>
1 # This does not work!
----> 2 print(a * c)
ValueError: operands could not be broadcast together with shapes (3,4,7) (4,)
# ...but this does!
print(a * c[:, np.newaxis])
[[[ 0 0 0 0 0 0 0]
[ 7 8 9 10 11 12 13]
[ 28 30 32 34 36 38 40]
[ 63 66 69 72 75 78 81]]
[[ 0 0 0 0 0 0 0]
[ 35 36 37 38 39 40 41]
[ 84 86 88 90 92 94 96]
[147 150 153 156 159 162 165]]
[[ 0 0 0 0 0 0 0]
[ 63 64 65 66 67 68 69]
[140 142 144 146 148 150 152]
[231 234 237 240 243 246 249]]]
print(a * b * c[:, np.newaxis])
[[[ 0 0 0 0 0 0 0]
[ 0 8 18 30 44 60 78]
[ 0 30 64 102 144 190 240]
[ 0 66 138 216 300 390 486]]
[[ 0 0 0 0 0 0 0]
[ 0 36 74 114 156 200 246]
[ 0 86 176 270 368 470 576]
[ 0 150 306 468 636 810 990]]
[[ 0 0 0 0 0 0 0]
[ 0 64 130 198 268 340 414]
[ 0 142 288 438 592 750 912]
[ 0 234 474 720 972 1230 1494]]]
print(a * b * c[:, np.newaxis] * d[:, np.newaxis, :])
[[[ 0 0 0 0 0 0 0]
[ 0 8 36 90 176 300 468]
[ 0 30 128 306 576 950 1440]
[ 0 66 276 648 1200 1950 2916]]
[[ 0 0 0 0 0 0 0]
[ 0 288 666 1140 1716 2400 3198]
[ 0 688 1584 2700 4048 5640 7488]
[ 0 1200 2754 4680 6996 9720 12870]]
[[ 0 0 0 0 0 0 0]
[ 0 960 2080 3366 4824 6460 8280]
[ 0 2130 4608 7446 10656 14250 18240]
[ 0 3510 7584 12240 17496 23370 29880]]]