Navigating State Transitions and Rewards: PR, P, RT, and R in MRP

In the realm of reinforcement learning, understanding the dynamics of transitioning between states and the rewards associated with these transitions is critical. Four key players help us navigate this landscape: PR, P, RT, and R.

PR(s, r, s') serves as our comprehensive guide, providing the joint probability of transitioning to a new state s' and receiving a reward r when the current state is s. P(s, s'), on the other hand, is like a skilled scout focusing on the terrain, giving us the likelihood of moving from state s to state s', irrespective of the associated reward.

Now, when it comes to reaping potential rewards from a state transition, RT(s, s') comes into play. As the reward transition function, RT(s, s') estimates the expected reward when transitioning from state s to state s'. It does this by taking a weighted sum of all possible rewards (with the weights given by PR(s, r, s')) and normalizes this sum by the total transition probability P(s, s'). This normalization ensures our expectation accurately reflects the real transition probabilities, preventing us from over- or underestimating the expected reward.

Finally, we have R(s), which stands as the grand strategist. Unlike RT, which focuses on the reward for each state transition, R(s) gives an overall perspective. It calculates the expected reward for being in the current state s by looking at all potential next states and their associated expected rewards. It does this by summing over all next states the product of the transition probability to each next state (given by P(s, s')) and the expected reward for that transition (given by RT(s, s')).

These four functions (PR, P, RT, R) are all related to the concepts of joint and conditional probabilities：

PR(s, r, s') represents the joint probability of transitioning to a new state s' and receiving a reward r given the current state s.

PR(s, r, s') = P[(R_{t+1} = r, S_{t+1} = s')|S_t = s] \quad \text{for time steps } t = 0, 1, 2, \ldots, \text{for all } s \in N , r \in D, s' \in S, \text{ such that } \sum_{s'\in S} \sum_{r\in D} PR(s, r, s') = 1 \text{ for all } s \in N

P(s, s') is the probability of transitioning from state s to state s', regardless of the reward. This is a conditional probability where the next state depends on the current state.

The transition probability function of the implicit Markov Process defined as:

$P(s, s') = \sum_{r\in D} PR(s, r, s')$

RT(s, s') is the expected reward when transitioning from state s to state s', calculated as a weighted sum of all possible rewards, with the weights given by their respective probabilities (from PR(s, r, s')), which involves considering both the current state and the next state. This represents an expectation, which is a form of conditional probability where the expected outcome depends on both the current state s and the next state s'.

$RT (s, s') = E[R_{t+1}|S_{t+1} = s', S_t = s] = \sum_{r\in D} \frac{PR(s, r, s')}{P(s, s')} \cdot r$

R(s) gives the expected reward for being in the current state s by looking at all potential next states and their associated expected rewards. This also involves conditional probability, since the expected reward depends on the current state.

$R(s) = E[R_{t+1}|S_t = s] = \sum_{s'\in S} P(s, s') \cdot RT (s, s') = \sum_{s'\in S} \sum_{r\in D} PR(s, r, s') \cdot r$

Together, PR, P, RT, and R provide a comprehensive framework for understanding the dynamics of state transitions and rewards, critical for determining optimal policies in reinforcement learning.

PreviousUnderstanding Markov Processes and their Representations NextValue Function Computation in Markov Reward Processes

Last updated 2 years ago