More NumPy
Additional features of NumPy
- Linear algebra
- Random sampling
- Vectorized functions
Linear Algebra
NumPy provides a module for linear algebra called linalg
This is a sub-module of NumPy
- Attributes can be accessed via
np.linalg.attribute_name
- An interface to the well-known LAPACK Fortran library
The module provides functions for computing determinants, eigenvalues, etc.
>>> A = np.array([[1, 2], [3, 4]])
>>> np.linalg.det(A) # Compute the determinant
-2.0
>>> np.linalg.inv(A) # Compute the inverse
array([[-2. , 1. ],
[ 1.5, -0.5]])
>>> np.linalg.eigvals(A) # Compute eigenvalues
array([-0.37228132+0.j, 5.37228132+0.j])
To solve the linear system A x = b for x we can use
solve()
>>> A
array([[1, 2],
[3, 4]])
>>> b = (5, 6)
>>> np.linalg.solve(A, b)
array([-4. , 4.5])
Another alternative is to invert A
>>> np.dot(np.linalg.inv(A), b)
array([-4. , 4.5])
But
solve()
is usually more efficientRandom Sampling
NumPy has a sub-module called random
Populates
ndarrays
with samples of various distributions
For example, to generate 3 draws from the beta distribution
we use
random.beta
>>> a = 0.5
>>> b = 0.5
>>> np.random.beta(a, b, size=3)
array([ 0.23607443, 0.90941746, 0.49067643])
Binomial distribution:
>>> y = np.random.binomial(10, 0.5, size=1000) # 1,000 observations of Bin(10, 0.5)
>>> y.mean()
5.1109999999999998
>>> y = np.random.binomial(10, 0.8, size=1000)
>>> y.mean()
7.9909999999999997
The normal distribution can be accessed via
np.random.normal
, but there is also the convenience function randn
, which gives standard normals>>> Z = np.random.randn(10000)
>>> Z = 5 * Z + 3 # Each element is now N(3, 25)
>>> Z.mean()
2.997125110268561
>>> Z.var()
24.970266678202606
All the other standard distributions can be found in
np.random
- To get the list, type
dir(np.random)
- Then use the docstrings
Other Goodies
NumPy provides some vectorized functions, which act elementwise on arrays
>>> Z = np.linspace(2, 4, 4)
>>> np.log(Z)
array([ 0.69314718, 0.98082925, 1.2039728 , 1.38629436])
>>> np.exp(Z)
array([ 7.3890561 , 14.3919161 , 28.03162489, 54.59815003])
>>> np.cos(Z)
array([-0.41614684, -0.88932657, -0.981674 , -0.65364362])
>>> np.sin(Z)
array([ 0.90929743, 0.45727263, -0.19056796, -0.7568025 ])
The advantage is that we avoid looping
Note that not all Python functions will be vectorized
>>> f = lambda x: -1 if x < 0 else 1
>>> f(-2)
-1
>>> f(3)
1
>>> Z = np.array([-2, 0, 2])
>>> f(Z)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <lambda>
ValueError: The truth value...
We can vectorize
f
using np.vectorize()
>>> f = np.vectorize(f)
>>> f(Z)
array([-1, 1, 1])
In addition
np.interp
provides linear interpolation in 1 dimnp.poly1d
is a class for manipulating polynomialsnp.histogram
gives histogram information
See docstrings for details
A Note on Pylab and Matplotlib
A quick aside on the relationship between Pylab and Matplotlib
One way to do plots with Matplotlib is like this
import pylab
pylab.plot([1, 2, 3])
pylab.show()
The same can be achieved by
import matplotlib.pyplot as plt
plt.plot([1, 2, 3])
plt.show()
What is the difference?
We can see the difference in the pylab initialization file:
## some stuff
from numpy import *
from numpy.fft import *
from numpy.random import *
from numpy.linalg import *
## some more stuff
from matplotlib.pyplot import *
## some more stuff
Thus,
import pylab
brings in- everything from the NumPy namespace
- everything from various NumPy submodules (
random
,linalg
, etc.) - everything from
matplotlib.pyplot
The plotting functions are in
matplotlib.pyplot
Exercises
Here is a common problem in simulation
Let's say we have an array
q
which represents a probability mass function- Array of nonnegative floats with
q.sum()
equal to one - Let
len(q)
be equal toN
We want to generate a random variable X such that
X
takes values in 0,...,N-1- The probability that
X = i
is equal toq[i]
The standard (inverse transform) algorithm is as follows
- Divide unit interval [0, 1] into N subintervals I
0 , I1,...,IN-1, such that the length of Ii isq[i]
- Draw a uniform r.v.
U
on [0, 1] and - Return the
i
such thatU
is in Ii
The probability of drawing i , which is equal to
i
is the length of Iq[i]
EDTC provides some code for this operation
from random import uniform
def sample(q):
"""
Returns i with probability q[i], where q is an array
(e.g., list or tuple).
"""
a = 0.0
U = uniform(0,1)
N = len(q)
for i in range(N):
if a < U <= a + q[i]:
return i
a = a + q[i]
Can you see how it works?
- If not, try thinking through the flow for a simple example
- e.g., set
q = [0.25, 0.75]
- Consider different values of
U
- It helps to sketch the intervals on paper
- e.g., set
Problem 1
Make the following two improvements to this code:
- Speed it up using NumPy, avoiding
for
andwhile
loops- Hint: Use searchsorted() and cumsum()
- Implement as a class, where each instance stores its own value of
q
- Give the class a
draw()
method, which returns one draw from{0,...,len(q)-1}
according to q- Or even better, write the method so that
draw(n)
returns n draws from q
- Or even better, write the method so that
- Give the class a
Solution:
## Filename: discreterv.py
## Author: John Stachurski
from numpy import cumsum
from numpy.random import uniform
class DiscreteRV:
"""
Each instance is provided with an array of probabilities q.
The draw() method returns x with probability q[x].
"""
def __init__(self, q):
self.set_q(q)
def set_q(self, q):
self.Q = cumsum(q) # Cumulative sum
def draw(self, n=1):
"""
Returns n draws from q
"""
return self.Q.searchsorted(uniform(0, 1, size=n))
A bit tricky, but with some thought you will be able to understand it
Problem 2
In a previous lecture and EDTC, section 6.1.2, I discuss the empirical (cumulative) distribution function
Thus Fn(x) is the fraction of the sample that falls below x.
This function approximates the true distribution function F
- Converges to F(x) with probability one as n goes to infinity for all x
The ECDF is implemented as follows:
## Filename: ecdf.py
## Author: John Stachurski
class ECDF:
def __init__(self, observations):
self.observations = observations
def __call__(self, x):
counter = 0.0
for obs in self.observations:
if obs <= x:
counter += 1
return counter / len(self.observations)
Here's an example of usage:
from ecdf import ECDF
samples = np.random.uniform(0, 1, size=10)
F = ECDF(samples)
F(0.5) # Returned 0.29
F.observations = np.random.uniform(0, 1, size=1000)
F(0.5) # Returned 0.479
Our implementation is not very efficient because it uses a
for
loop
Problem:
- Make the
__call__
method more efficient using NumPy. - Add a method which plots the ECDF over [a, b]
- Here a and b are method parameters
Solution:
## Filename: ecdf2.py
## Author: John Stachurski
import numpy as np
import matplotlib.pyplot as plt
class ECDF:
def __init__(self, observations):
"Parameters: observations is a NumPy array."
self.observations = observations
def __call__(self, x):
try:
return (self.observations <= x).mean()
except AttributeError:
# self.observations probably needs to be converted to
# a NumPy array
self.observations = np.array(self.observations)
return (self.observations <= x).mean()
def plot(self, a=None, b=None):
if a == None:
# Set up a reasonable default
a = self.observations.min() - self.observations.std()
if b == None:
# Set up a reasonable default
b = self.observations.max() + self.observations.std()
X = np.linspace(a, b, num=100)
f = np.vectorize(self.__call__)
plt.plot(X, f(X))
plt.show()
0 comments:
Post a Comment