Round 2.1: Array Reductions

Nov 6, 2012 • Chang She

Array reductions represent a extremely useful class of array operations that involves computing aggregate statistics over the array. Numpy ndarrays come with methods like ‘sum’, ‘mean’, ‘std’, and many other reductions that are convenient to use. Specifying an argument for the ‘axis’ parameter means computing the reduction along the given dimension, while omitting the ‘axis’ parameter means computing the reduction over all elements in the array.

Prerequisite Knowledge

Assume user knows basic mathematical operations like sum, mean, and standard deviation. Users should also already be familiar with how to import numpy, create Numpy ndarrays, and how to index them. For the testing part of the examples, the user needs to know what for-loop is.

Example

Suppose we’re conducting a scientific experiment measuring repeated samples of several variables. We can record the data in a 2-dimensional array where each row represent one sample for all variables and each column represent all samples for one variable:

In [52]: import numpy as np

In [53]: arr = np.array([[0, -0.5, 1.2], [1, 0.75, -2.], [0, 0.3, 4], [1, -0.1, -3.]])

In [54]: arr
Out[54]: 
array([[ 0.  , -0.5 ,  1.2 ],
       [ 1.  ,  0.75, -2.  ],
       [ 0.  ,  0.3 ,  4.  ],
       [ 1.  , -0.1 , -3.  ]])

As you can see we have taken 4 samples for each of our 3 variables.

We can compute the mean and standard deviation of all samples for each variable:

In [55]: arr.mean(axis=0)
Out[55]: array([ 0.5   ,  0.1125,  0.05  ])

In [56]: arr.std(axis=0)
Out[56]: array([ 0.5       ,  0.46418612,  2.75816968])

Let’s spot check our results for mean by computing the mean for column 1 (second variable) explicitly using a for-loop:

In [63]: column1 = arr[:, 1]

In [64]: sum = 0

In [65]: for x in column1:
   ....:     sum = sum + x
   ....:     

In [66]: sum / len(column1)
Out[66]: 0.11250000000000002

The result is the same as the second entry in the returned array from “arr.mean(axis=0)”.

Suppose the variables are all parts of an integrated whole. It might be interesting to see the total value for each observation:

In [57]: arr.sum(axis=1)
Out[57]: array([ 0.7 , -0.25,  4.3 , -2.1 ])

Here the return value has 4 entries because we have taken 4 samples for all the variables.

If we omit the ‘axis’ parameter then the computation is performed on the entire array:

In [58]: arr.sum()
Out[58]: 2.6500000000000004

###

Q1. How do I test whether “arr.sum()” returns the correct value?
A1.

sum = 0
for row in arr:
    for x in row:
sum += x

Q2. What is the mean value for all variables each observation?
A2. arr.mean(axis=1)

Q3. What is the variance of each variable?
A3. arr.var(axis=0)

Extra Credit: Compute the sample standard deviation for each variable?
Answer: arr.std(ddof=1)

 

Time spent

1 hour on examples and diagrams
1 hour on reading