Numpy¶

Numpy (NUMerical PYthon) provides very useful arrays structure to work with data.

Arrays¶

Numpy's array is a generalization of list discussed in chapter1, it is more appropriate for the computation.

import numpy as np
weight=[58,89,68,74,62,77,65,65]
weight
weight_arr=np.array(weight)
weight_arr

It is like a vector, all elements should have same type, therefore if you add a strict element, it saves all elements as strict. It also accept multi lists

arr1=np.array([range(i) for i in [1, 2, 3]])
arr1[1]
arr1[1][0]
arr2=np.array([range(i, j+i) for i in [1, 2, 3] for j in [1, 2, 3]])
arr2[1]
arr2[1][0]

A constant array can be generated using np.full(,)

np.full(2, 2.2)
np.full((2,1), 2.2)
np.full((2,2), 2.2)

np.repeat(2.2, 2)
np.repeat([2.1,2.2],2)
np.repeat([2.1,2.2], [2, 3])

np.arange(1,14,4)
np.arange(21,30,3)
np.arange(2,1,-0.1)

To create an array of n values between two values

np.linspace(0, 1, 10)

To refer elements of array should use []

weight[1]# first element
weight[2:]# second elements to the rest
weight[:3]# elements before the third and including the third

To refer elements of multi array should use [,]

weight2=np.array([weight_arr,2.20*weight_arr,35.27*weight_arr])
weight2[1,1]
weight2[1:,1:]
weight2[1:,2:]

To change the shape

weight2.reshape((8, 3))

concatenate¶

It provides functions to concatenate the arrays

>>> w1 = weight_arr[:4]
>>> w2 = weight_arr[4:]
>>> np.concatenate((w1, w2), axis=0)
array([58, 89, 68, 74, 62, 77, 65, 65])
>>> w1r=w1.reshape(2,2)
>>> w2r=w2.reshape(2,2)
>>> np.concatenate((w1r,w2r), axis=0)
array([[58, 89],
       [68, 74],
       [62, 77],
       [65, 65]])
>>> np.vstack((w1r,w2r))
array([[58, 89],
       [68, 74],
       [62, 77],
       [65, 65]])
>>> np.concatenate((w1r,w2r), axis=1)
array([[58, 89, 62, 77],
       [68, 74, 65, 65]])
>>> np.hstack((w1r,w2r))
array([[58, 89, 62, 77],
       [68, 74, 65, 65]])

The array can be split into subsplit

w1, w2, w3,w3,w4 = np.split(weight_arr, 4)

Useful function¶

Numpy provides very useful bult-in function,

np.sort(weight2,axis=0)
np.sort(weight2,axis=1)
>>> np.argsort(weight2,axis=0)
array([[0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2, 2, 2, 2]])
>>> np.argsort(weight2,axis=1)
array([[0, 4, 6, 7, 2, 3, 5, 1],
       [0, 4, 6, 7, 2, 3, 5, 1],
       [0, 4, 6, 7, 2, 3, 5, 1]])

>>> np.min(weight2)
58.0
>>> np.min(weight2,axis=1)
array([   58.  ,   127.6 ,  2045.66])
>>> np.min(weight2,axis=0)
array([ 58.,  89.,  68.,  74.,  62.,  77.,  65.,  65.])

The following list includes some useful functions:

Function Name	NaN-safe Version	Description
np.sum	np.nansum	Compute sum of elements
np.prod	np.nanprod	Compute product of elements
np.mean	np.nanmean	Compute mean of elements
np.std	np.nanstd	Compute standard deviation
np.var	np.nanvar	Compute variance
np.min	np.nanmin	Find minimum value
np.max	np.nanmax	Find maximum value
np.argmin	np.nanargmin	Find index of minimum value
np.argmax	np.nanargmax	Find index of maximum value
np.median	np.nanmedian	Compute median of elements
np.percentile	np.nanpercentile	Compute rank-based statistics of elements
np.any	N/A Evaluate	whether any elements are true
np.all	N/A Evaluate	whether all elements are true

The isnan is useful command to study if the object includes the NA or not.

>>> y=np.array([1, 2, np.nan])
>>> np.isnan(y)
array([False, False,  True], dtype=bool)

Matrix¶

Numpy has very strong functions for matrices.

mat=np.array([i for i in range(9)]).reshape(3,3)
mat.diagonal()
np.diagonal(mat)
mat.trace()
np.trace(mat)
mat.transpose() # transpose
np.transpose(mat)
mat2=np.array([i for i in range(6)]).reshape(3,2)
np.dot(mat,mat2) # dot product 
np.inner(mat,mat) # inner product

More matrical functions can be obtained from Linear algebra, numpy.linalg,

a=np.array([[1,2],[3,4]])
np.linalg.inv(a) # inverse
np.linalg.pinv(a) # inverse
np.linalg.det(a)

y = np.array([[5.], [7.]])
np.linalg.solve(a, y)
np.linalg.eig(a)

Instead generating matrix from array, numpy has numpy.matlib that directly can be used to generate matrix

import numpy.matlib as npmat
np.eye(3)
np.zeros((4,2))
np.ones((4,2))

Calculate the beta XTX = np.dot(X.T, X) INV = np.linalg.inv(XTX) beta = np.dot(np.matmul(INV, X.T), y) beta

Random number¶

Numpy has very strong function for generating random number from several statistical distributions.

np.random.rand(4,2) #Generate n random number from uniform

np.random.randn(4,2) #Generate n random number from standard normal

np.random.randint(low=1,high=20, size=10)

x = np.arange(12).reshape((4 ,3))
np.random.shuffle(x)

a0=['a','b']
a0=np.array(a0)
np.random.choice(a0,size=3, replace=True,p=(.3,.7))

Subsetting¶

The logical operator is often used to extract subset of data, using the array can easily achieve selecting subsets.

>>> x= np.array(["a", "b", "c"])
>>> y= np.array([3, 4, "c"])
>>> set(x).union(y) # union
{'c', '4', 'b', '3', 'a'}
>>> set(x).intersection(y)
{'c'}
>>> set(x)- set(y)
{'b', 'a'}
>>> [x[i]==y[i] for i in  range(len(x))]# the same as is.ellemt in R
[False, False, True]

>>> weight=[58,89,68,74,62,77,65,65]
>>> weight=np.array(weight)
>>> weight<74
array([ True, False,  True, False,  True, False,  True,True], dtype=bool)
>>> (weight<74) & (weight==89)
array([False, False, False, False, False, False, False, False], dtype=bool)
>>> weight[(weight<74) & (weight==89)]
array([], dtype=int64)
>>> weight[(weight<74) & (weight==62)]
array([62])
>>> weight[(weight<74) | (weight==62)]
array([58, 68, 62, 65, 65])
>>> weight[~(weight<74) & (weight==62)]
array([], dtype=int64)
>>> weight[~((weight<74) | (weight==62))]
array([89, 74, 77])

>>>weight2=np.array([weight_arr,2.20*weight_arr,35.27*weight_arr])
>>> np.sum(weight2 < 127, axis=1)
array([8, 0, 0])
>>> np.any(weight2 < 127)
True
>>> np.sum(weight2 < 127, axis=1)
array([8, 0, 0])

Combined data¶

The data set might include different data like dataframe in R that each column has data with different format, array can save such data.

>>>weight=[58,89,68,74,62,77,65,65]
>>>gender=['F','F','F','M','M','M','M','M']
>>> data = np.zeros(8, dtype={'names':('sex', 'weight'),'formats':('U10', 'f4')})
>>> data['sex']=gender
>>> data['weight']=weight
>>> data
array([('F',  58.), ('F',  89.), ('F',  68.), ('M',  74.), ('M',  62.),
       ('M',  77.), ('M',  65.), ('M',  65.)],
      dtype=[('sex', '<U10'), ('weight', '<f4')])

>>> data['weight']
array([ 58.,  89.,  68.,  74.,  62.,  77.,  65.,  65.], dtype=float32)
>>> data['weight'][1:3]
array([ 89.,  68.], dtype=float32)

When you define the array, you should define the format as well, data can be save with different format see, ?????

Character	Description	Example
'b'	Byte	np.dtype('b')
'i'	Signed integer	np.dtype('i4') == np.int32
'u'	Unsigned	integer
'f'	Floating point	np.dtype('f8') == np.int64
'c'	Complex floating point	np.dtype('c16') == np.complex128
'S', 'a'	String	np.dtype('S5')
'U'	Unicode string	np.dtype('U') == np.str_
'V'	Raw data (void)	np.dtype('V') == np.void

masked array¶

There are many circumstances where one should drop part of it, because of missing, invalid entries, or other reasons. Consider the weight and mask weight<70,

weight_arr=np.array(weight)
weight_m=np.array(np.zeros(len(weight)))
weight_m[weight_arr<70]=1
weight_ma1=np.ma.masked_array(weight_arr, mask=weight_m)
weight_ma1.data
weight_ma1.mask
np.mean(weight_ma1)
````

```{Python, echo = FALSE, message = FALSE}
weight_ma2=np.ma.masked_where(weight_arr<70,weight_arr)
weight_ma2
np.mean(weight_ma1)

weight_arr=np.array(weight)
weight_mas=np.ma.masked_where(weight_arr<70,weight_arr)
weight_mas
weight_mas.data
weight_mas.mask

np.ma.masked_array(weight_arr, mask=aa)