Value Function Computation in Markov Reward Processes

The value function of a Markov reward process can be computed using different methods. One method is to use the Bellman equation, which is a recursive equation that relates the value of a state to the values of its successor states. The Bellman equation can be solved iteratively using value iteration, which is an algorithm that updates the value function until it converges to the true value function. Another method is to use linear algebra, which is possible when the Markov reward process is finite, meaning that it has a finite number of states and rewards. In this case, the value function can be written as a system of linear equations, which can be solved analytically using matrix inversion.

The value function of a Markov reward process has some properties that are useful to know. One property is monotonicity, which means that the value function is non-decreasing with respect to the rewards. This means that if we increase the reward for any state, the value function for that state and all other states will increase or stay the same. Another property is boundedness, which means that the value function is bounded above and below by some finite numbers. This means that the value function cannot go to infinity or negative infinity, regardless of how large or small the rewards are. The bounds on the value function depend on the discount factor and the range of rewards. A third property is that the value function depends on the transition probabilities and rewards of the Markov reward process, but not on how it is started or terminated. This means that the value function is independent of the initial state or any terminal states, as long as they are part of the Markov reward process.

Value iteration and linear algebra are two methods for computing the value function of a Markov reward process with three states. The main points of the comparison are:

Value iteration is an iterative method that applies the Bellman equation until convergence, while linear algebra is an analytical method that solves a system of linear equations using matrix inversion.
Value iteration can handle any Markov reward process, but it can be slow, costly, and error-prone. Linear algebra can only handle finite Markov reward processes, but it can be fast, efficient, and robust.
The choice between value iteration and linear algebra depends on the size and structure of the Markov reward process, the accuracy and speed requirements, and the computational resources available.

The value function of a Markov reward process can vary depending on the transition probabilities and rewards of the Markov reward process. Here are some examples of value functions for different Markov reward processes:

A Markov reward process with a single state and a constant reward. In this case, the value function is equal to the reward divided by one minus the discount factor. For example, if the reward is 1 and the discount factor is 0.9, then the value function is 10.

The value function of a Markov reward process is the expected total discounted reward that can be obtained starting from a given state. The reward divided by one minus the discount factor is the value function of a Markov reward process with a single state and a constant reward because it is the maximum of a rational function that satisfies the Bellman equation. The Bellman equation is a recursive equation that relates the value function of a state to the value functions of its successor states. In this case, there is only one state, so the Bellman equation reduces to: v(s)=r+γv(s) where v(s) is the value function of the single state s, r is the constant reward, and γ is the discount factor. Solving for v(s) gives: v(s) = r / (1 - γ) This expression is also the limit of the geometric series: v(s)=r+γr+γ2r+…

as γ approaches 1. This series represents the total discounted reward that can be obtained starting from state s and staying in it forever.

A Markov reward process with two states and deterministic transitions. For example, if the rewards are 1 and 2, and the discount factor is 0.9.

In a Markov Reward Process (MRP) with two states and deterministic transitions, the value function for each state can be calculated using the Bellman equation:

V(S) = R(S) + γ * Σ P(S'|S) * V(S')

where:

V(S) is the value function of state S,
R(S) is the immediate reward for state S,
γ is the discount factor,
P(S'|S) is the transition probability from state S to S',
V(S') is the value of the next state S'.

For deterministic transitions, the transition probabilities are 1, so the Bellman equation simplifies to:

V(S1) = R1 + γ * V(S2)
V(S2) = R2 + γ * V(S1)

This forms a system of linear equations, which can be solved to find the value functions V(S1) and V(S2).

While the value function does take into account the rewards of all future states, in this specific case with deterministic transitions between two states, it's sufficient to consider just the immediate successor state in the calculation. The system of equations ensures that the value function for each state includes the discounted future rewards from all possible future states.

2nd approach：

For an MRP with two states and deterministic transitions, the value function can also be calculated using the infinite horizon approach. The value function of a state is the expected total discounted reward starting from that state, which can be represented as an infinite series:

v(s1) = r1 + γ * r2 + γ^2 * r1 + γ^3 * r2 + ...
v(s2) = r2 + γ * r1 + γ^2 * r2 + γ^3 * r1 + ...

These series can be summed using the formula for the sum of an infinite geometric series, giving the value functions:

v(s1) = (r1 + γ * r2) / (1 - γ^2)
v(s2) = (r2 + γ * r1) / (1 - γ^2)

These equations represent the expected total discounted reward from each state, taking into account the immediate reward and the discounted future rewards from the other state.

A Markov reward process with three states and stochastic transitions. In this case, the value function can be computed using the Bellman equation or linear algebra. For example, if the rewards are 1, 2, and 3, the discount factor is 0.9, and the transition probabilities are given by the following matrix:

0.5

Here's the step-by-step solution path:

Define the problem: We have a Markov Reward Process (MRP) with three states and stochastic transitions. The rewards for the three states are given as 1, 2, and 3, the discount factor is 0.9, and the transition probabilities are given by the matrix:
```
|   | S1 | S2 | S3 |
|---|----|----|----|
| S1| 0.5| 0.5| 0  |
| S2| 0  | 0.5| 0.5|
| S3| 0.5| 0  | 0.5|
```
Understand the value function: The value function of an MRP is a measure of the expected cumulative discounted reward from being in a particular state, taking into account the rewards of all future states that could be visited. It is given by the Bellman equation:
```
V(S) = R(S) + γ * Σ P(S'|S) * V(S')
```
where:
- V(S) is the value function of state S,
- R(S) is the immediate reward for state S,
- γ is the discount factor,
- P(S'|S) is the transition probability from state S to S',
- V(S') is the value of the next state S'.
Apply the Bellman equation to our problem: For stochastic transitions, the transition probabilities are given by the matrix P. Therefore, the value function for each state can be calculated as follows:
```
V(S1) = R1 + γ * (P(S1|S1) * V(S1) + P(S2|S1) * V(S2) + P(S3|S1) * V(S3))

V(S2) = R2 + γ * (P(S1|S2) * V(S1) + P(S2|S2) * V(S2) + P(S3|S2) * V(S3))

V(S3) = R3 + γ * (P(S1|S3) * V(S1) + P(S2|S3) * V(S2) + P(S3|S3) * V(S3))
```
This forms a system of linear equations.
Solve the system of equations: We can solve this system of equations to find the values of V(S1), V(S2), and V(S3). This can be done using linear algebra, specifically the LinearSolve function in Wolfram Language. The system of equations can be represented in matrix form as (I - γP)V = R, where I is the identity matrix, P is the transition probability matrix, V is the vector of value functions, and R is the vector of rewards. Solving for V gives the value functions for the three states.

Using the given rewards, discount factor, and transition probabilities, we find:

V(S1) ≈ 18.67
V(S2) ≈ 20.60
V(S3) ≈ 20.73

These values represent the expected cumulative discounted reward from being in each state, taking into account the rewards of all future states that could be visited, given the stochastic transitions.

I can solve this problem directly in Python using the NumPy library, which provides functions for working with arrays and matrices. Here are two different schemes:

Iterative method using the Bellman equation

This method involves initializing the value function and then iteratively updating it using the Bellman equation until it converges.

import numpy as np

# Define the rewards, discount factor, and transition probabilities
R = np.array([1, 2, 3])
gamma = 0.9
P = np.array([[0.5, 0.5, 0], [0, 0.5, 0.5], [0.5, 0, 0.5]])

# Initialize the value function
V = np.zeros(3)

# Iteratively update the value function
while True:
    V_new = R + gamma * P.dot(V)
    if np.allclose(V, V_new):
        break
    V = V_new

print(V)

Direct method using linear algebra

This method involves solving the system of equations (I - γP)V = R directly using the numpy.linalg.solve function.

import numpy as np

# Define the rewards, discount factor, and transition probabilities
R = np.array([1, 2, 3])
gamma = 0.9
P = np.array([[0.5, 0.5, 0], [0, 0.5, 0.5], [0.5, 0, 0.5]])

# Solve for the value function
V = np.linalg.solve(np.eye(3) - gamma * P, R)

print(V)

Both of these methods will give you the same result. The iterative method may be more intuitive as it directly implements the Bellman equation, but the direct method is more efficient as it solves the system of equations in one step.

PreviousNavigating State Transitions and Rewards: PR, P, RT, and R in MRP NextState-Value and Action-Value Functions

Last updated 2 years ago