# NumPy

## What is NumPy?

NumPy is a Python library built around the concept of arrays, which are collections of elements. The elements of a NumPy array are usually (but not necessarily) numbers, and NumPy allows you to perform calculations with those numbers.

Everything that you can do with NumPy, you can also do with built-in Python objects (notably `list`s, see iterables). But NumPy is much faster and, in many cases, more elegant.

NumPy is not part of the Python standard library. However, it is such a widely used library that it is pre-installed in most Python environments.

## Arrays

### Dimensions

NumPy arrays are called `ndarray` objects. The `nd` stands for "N-dimensional", which is indicates that an array can have any numbers of dimensions. A 1D array is like a `list`, because you need only a single coordinate (or index) to indicate an element. A 2D array is like a spreadsheet, because you need two coordinates (a row and a column, or an X and a Y coordinate) to indicate an element. Higher-dimensional arrays are also possible, although arrays of more than three dimensions are difficult to mentally picture. (But this is a limitation of your brain, not of NumPy!)

A real-life example of a 5D array might be an fMRI dataset (a brain scan) of multiple participants. Here, the first three dimensions would be space (X, Y, Z), because a brain is a 3D object. The fourth dimension would time, because an fMRI scan measures brain activity over time. And the fifth dimension would be the participant, because there are multiple participants. Therefore, you need five numbers to identify an element in this dataset (X, Y, Z, time, participant number).

### Data types

A NumPy array has a single data type (its `dtype`), and all elements in the array are of that type. This is different from a `list` object, which can contain elements of different types. The most common data types for NumPy arrays are `bool`, `int`, `float`, and `str`.

## Creating an array

First, we're going to import `numpy` with the alias `np` (see modules). The `np` alias is by convention, and has the advantage that it requires less typing than `numpy`.

```import numpy as np
```

You can create an array directly from a `list`. If you pass a `list` of `list`s, as in the example below, the array will become 2D; you can tell this from the shape property, which returns the size of each dimension. The length of an array is the size of the first dimension. The `dtype` of the array will also be automatically determined, unless you explicitly specify this with the `dtype` keyword. In this case, it's an `int` array (or `int64`, ).

```a = np.array([[1, 2, 3], [4, 5, 6]])
print(a)
print('dtype: {}'.format(a.dtype))
print('shape: {}'.format(a.shape))
print('length: {}'.format(len(a)))
```

Output:

```[[1 2 3]
[4 5 6]]
dtype: int64
shape: (2, 3)
length: 2
```

If you want an array of only zeros, you can use `np.zeros()`. If you want an array of only ones, you can use `np.ones()`. The only required argument is the desired shape.

```print('zeros:')
print(np.zeros((2, 3)))
print('ones:')
print(np.ones((2, 3)))
```

Output:

```zeros:
[[0. 0. 0.]
[0. 0. 0.]]
ones:
[[1. 1. 1.]
[1. 1. 1.]]
```

The NumPy equivalent of `range()` is `np.arange()`. `np.linspace()` is similar to `np.arange()` but differs in its arguments.

```print(np.arange(start=0, stop=2, step=0.5))
print(np.linspace(start=0, stop=1.5, num=4))
```

Output:

```[0.  0.5 1.  1.5]
[0.  0.5 1.  1.5]
```

And finally, the `np.random` module has functions for generating random numbers. For example, `np.random.random()` generates random `float`s between 0 and 1:

```print(np.random.random((2, 3)))
```

Output:

```[[0.22355768 0.23136636 0.3226043 ]
[0.62453122 0.71399482 0.48603747]]
```

## Indexing and slicing

Let's start with a 2D array of numbers. First create a 1D array with `np.arange()` and then simply change its shape (this is allowed as long as the shape is compatible with the number of elements in the array).

```a = np.arange(9)
a.shape = 3, 3
print(a)
```

Output:

```[[0 1 2]
[3 4 5]
[6 7 8]]
```

To get a single number, specify the row (1) and column (2), as always starting from 0:

```print(a[1, 2])
```

Output:

```5
```

To get a single row, specify the index of that row:

```print(a)
```

Output:

```[3 4 5]
```

To get multiple rows, specify a slice:

```print(a[1:3])
```

Output:

```[[3 4 5]
[6 7 8]]
```

To get a column, get the full slice of rows (`:`), and then specify the column index: (The column now looks like a row, because the result is 1D!)

```print(a[:, 1])
```

Output:

```[1 4 7]
```

Or to get multiple columns:

```print(a[:, 1:3])
```

Output:

```[[1 2]
[4 5]
[7 8]]
```

## Selecting elements

Consider the expression (`a > 2`):

```print(a)
print('Larger than 2?')
print(a > 2)
```

Output:

```[[0 1 2]
[3 4 5]
[6 7 8]]
Larger than 2?
[[False False False]
[ True  True  True]
[ True  True  True]]
```

For each element of the array, NumPy checks whether the value is larger than 2. The result is an array of `bool` values. This `bool` array is in itself not that useful, but you can use it to index the original array, like so:

```print(a[a > 2])
```

Output:

```[3 4 5 6 7 8]
```

You now get an array of elements that are larger than 2. This array is only 1D, even though the original array was 2D. This is because you have 'shot holes' in the array, and NumPy deals with this by flattening the resulting selection to a 1D array.

You can also select based on multiple criteria. The normal Python operators `and` and `or` don't work in this context! Instead, you have to use `&` (and) and `|` (or). And you have to put parentheses around the criteria. Like so:

```print(a[(a > 2) & (a < 7)])
```

Output:

```[3 4 5 6]
```

## Mathematical operations

If you apply the standard mathematical operators (`*`, `/`, `+`, `-`, `%`, `**`, `//`) to an array, then this operator is applied to each element separately, resulting in a new array. You can even apply multiple operators in a single expression:

```print(a)
print('a * 2 - 2:')
print(a * 2 - 2)
```

Output:

```[[0 1 2]
[3 4 5]
[6 7 8]]
a * 2 - 2:
[[-2  0  2]
[ 4  6  8]
[10 12 14]]
```

NumPy has many built-in functions for applying mathematical operations to arrays, such as `np.sqrt()`, `np.exp()`, `np.log()`, etc.

```print(a)
print('sqrt(a):')
print(np.sqrt(a))
```

Output:

```[[0 1 2]
[3 4 5]
[6 7 8]]
sqrt(a):
[[0.         1.         1.41421356]
[1.73205081 2.         2.23606798]
[2.44948974 2.64575131 2.82842712]]
```

## Mean, median, and standard deviation

You can easily get the mean, median, and standard deviation of an array:

```print(a)
print('mean = {}\nmedian = {}\nstd = {}'.format(
np.mean(a),
np.median(a),
np.std(a)
))
```

Output:

```[[0 1 2]
[3 4 5]
[6 7 8]]
mean = 4.0
median = 4.0
std = 2.581988897471611
```

If you have a multidimensional array, then you can specify the axis across which the mean should be calculated. You can even specify multiple axes; this sounds a bit abstract, but you can think of it as specifying which dimensions should be averaged out. Therefore, specifying all axes is the same as taking the mean across the entire array.

```a = np.array([
[1, 2, 3],
[4, 5, 6]
])
print(a)
print('mean(across first axis) = {}'.format(a.mean(axis=0)))
print('mean(across second axis) = {}'.format(a.mean(axis=1)))
print('mean(across all axes) = {}'.format(a.mean(axis=(0, 1))))
```

Output:

```[[1 2 3]
[4 5 6]]
mean(across first axis) = [2.5 3.5 4.5]
mean(across second axis) = [2. 5.]
mean(across all axes) = 3.5
```

If an array contains `nan` values (see syntax), then each of the above functions will also return `nan`. This is because of the mathematical convention that every operation that involves a `nan` evaluates to `nan`. However, this behavior is often not what you want, and therefore NumPy contains functions that explicitly ignore `nan` values:

```a = np.linspace(0, 10, 5)
a = np.nan
print(a)
print('mean = {}\nmedian = {}\nstd = {}'.format(
np.mean(a),
np.median(a),
np.std(a)
))
print('nanmean = {}\nnanmedian = {}\nnanstd = {}'.format(
np.nanmean(a),
np.nanmedian(a),
np.nanstd(a)
))
```

## Exercises

### Removing extreme values

• Generate an array of a thousand random numbers
• Print out the shape of the array, and the mean and standard deviation of these numbers
• Remove all numbers that are more than one standard deviation above or below the mean
• Again, print out the shape of the array, and the mean and standard deviation of the remaining numbers

View solution

### Activity in the left vs right brain

• Read this dataset, which has been adapted from the StudyForrest project. It is an `.npy` file, which you can read with `np.load()`.
• This will give you a 4D array with the following dimensions: left-to-right, back-to-front, bottom-to-top, time (1 sample per 2 s). Values in the array correspond to BOLD values (an indirect measure of brain activity).
• First, inspect the shape of the array to see how big each dimension is.
• Next, print out the average BOLD value separately for the left and the right side of the brain!

View solution