Application: Finite State Optimal Growth
This lecture:
- Formulate a simple, finite state optimal growth problem
- Solve by value iteration
The lecture is based on section 5.1 of EDTC
- Please refer to that section for detailed explanations
- You need some familiarity with dynamic programming to read this lecture
- And a bit more mathematics than the shortest path application
Outline of the Problem
Model behavior of Colonel Kurtz, who can be found on small island in the Nung river living on catfish
Catfish bite at dawn, colonel bags a random quantity W
- A random variable on 0,...,B with distribution q
Catfish last only if refrigerated, and the Colonel's freezer holds up to M fish
Let Xt be the state variable: stock of fish at noon on day t
In the evening, the Colonel consumes quantity Ct, and freezes Rt = Xt - Ct
Following the next morning's fishing trip, the state is
Assume that Kurtz follows policy function g: on observing state X, saves R = g(X)
State space (possible values for X) is S = 0,...,M+B
Policy function g maps S into 0,...,M, and
Denote the set of all feasible policies by G. For given g in G, state evolves according to
Letting U be the utility function, optimization problem of Colonel Kurtz is
Given that we are currently at state x, the reward from following policy g is
The value function is defined as
The precise definition of optimality is as follows:
The Bellman Equation
Set of all feasible actions (number of fish that can be frozen) when the current state is x:
Bellman equation can be written as
The idea is as follows
Suppose we know the values of different states in terms of maximum future rewards
Then best action is found by trading off the two effects inherent in choosing an action:
- Current reward, and
- Future reward after transitioning to new state next period
- New state depends on current action
Making this trade off optimally gives maximum value from the current state
Given a real-valued function w on S, we call g in G w-greedy if
It is known that a policy is optimal if and only if it is v*-greedy
Hence computing an optimal policy is trivial if we know the value function
How to solve for the value function?
Let bS be the real-valued functions on S
Define the Bellman operator sending bS into bS by
Facts (see the text):
- T is a uniform contraction on bS with respect to the distance
- v is the unique fixed point of T in bS*
In consequence, iteration of T on any v in bS converges to v*
Value iteration algorithm
Iterate with T on some initial v in bS, producing (Tnv)
Continue iteration until
Computer a w-greedy policy g, where w is the final iterate
If the tolerance is small, then g is almost optimal (see the text)
Exercises
Solve for the optimal policy
- Implement the Bellman operator T as a Python function
- Elements of bS can be represented as lists or tuples
v[0]
corresponds to v(0), etc.
- Takes such a sequence
v
as an argument and - returns another sequence representing Tv
- Elements of bS can be represented as lists or tuples
- Write a function
greedy()
to compute greedy policies - Implement the loop as a function
main()
- Use 0.001 for the tolerance
For the utility function set
For the parameters, use
beta, rho, B, M = 0.5, 0.9, 10, 5
Let q be uniform on 0,...,B
Plot the sequence of functions created in the value iteration algorithm
Print the greedy policy (as a list) calcuated from the last iterate
The solution is
>>> g
[0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 5, 5]
Solution
import pylab
beta, rho, B, M = 0.5, 0.9, 10, 5
S = range(B + M + 1) # State space
Z = range(B + 1) # Shock space
def d_infty(v, w):
return max(abs(v[x] - w[x]) for x in S)
def U(c):
"Utility function."
return c**beta
def q(z):
"Probability mass function, uniform distribution."
return 1.0 / len(Z) if 0 <= z <= B else 0
def Gamma(x):
"The correspondence of feasible actions."
return range(min(x, M) + 1)
def T(v):
"""An implementation of the Bellman operator.
Parameters: v is a sequence representing a function defined on S.
Returns: Tv, a list."""
Tv = []
for x in S:
vals = [] # Stores rewards for a in Gamma(x)
for a in Gamma(x):
y = U(x - a) + rho * sum(v[a + z] * q(z) for z in Z)
vals.append(y)
Tv.append(max(vals))
return Tv
def greedy(v):
g = []
for x in S:
runningmax = -1
for a in Gamma(x):
y = U(x - a) + rho * sum(v[a + z] * q(z) for z in Z)
if y > runningmax:
runningmax = y
maximizer = a
g.append(maximizer)
return g
def main():
current = [U(x) for x in S]
tolerance = 0.001
while 1:
pylab.plot(current, 'bo-')
new = T(current)
if d_infty(new, current) < tolerance:
break
current = new
g = greedy(new)
print 'The optimal policy is:'
print g
pylab.show()
0 comments:
Post a Comment