State-Value and Action-Value Functions

A Markov Process is a framework that models sequential uncertainty or the unpredictability of the next state. By adding the concept of "uncertain reward", we can transform it into a Markov Reward Process. The driving force behind the transition from the current state to the next state depends on the dynamics of the structural property of the system, or its inherent nature, unless there is a predefined policy that determines the action.

Before, within given model ,we had no control over the state transitions between the current and next step(although we can choose different models), but now, by sampling actions from a policy-based choice space, we gain some partial control over the outcome ("current state" still remains part of the elements though). Or,we can say what MDPs add is the ability to choose different actions within a model.Given a policy, we make sequential decisions -as a response to the current state, which we denote as action "a".

A deterministic policy is a one-to-one mapping rule that assigns a specific action to each state, based on the policy and the current state. The action is certain and there is no randomness involved. A stochastic policy assigns a probability distribution over the actions for each state, based on the policy and the current state. The action is uncertain and there is some randomness involved. For example, the action can be sampled from a Poisson distribution or any other distribution.

An MDP and an MRP are intimately related as a fixed policy applied to an MDP effectively transforms it into an MRP, capturing the expected dynamics and rewards of the system when actions are selected according to that policy.

Now, let's dive into the grand spectacle of reinforcement learning, where two stars illuminate our understanding: the State-Value function, V(s), and the Action-Value function, Q(s,a).

V(s) = E[R_t+1 + γR_t+2 + γ^2R_t+3 + ... | S_t = s]

The State-Value function, V(s), is a nod towards the future's unpredictability. It's akin to standing at a crossroads, contemplating the myriad paths that unfold before us. Each path is determined by a possible action, and V(s) aims to understand the potential rewards of all these actions. It's a calculated prediction of the future, informed by the current state, but hesitant to commit to a single course of action.

Q(s, a) = E[R_t+1 + γR_t+2 + γ^2R_t+3 + ... | S_t = s, A_t = a]

In contrast, the Action-Value function, Q(s,a), embodies the courage to commit to a specific action. It measures the expectation of return for a state-action pair, telling us what might happen if we dare to step down a specific path. Despite this commitment, Q(s,a) also grapples with the unknown. It factors in the possible futures that unfold after the action, influenced by both the agent's choice and the whims of a stochastic environment.

V^{\pi}(s) = Σ π(a|s) * [R(s,a) + γ * Σ P(s'|s,a) * V^π(s')]

Q^{\pi}(s, a) = R(s, a) + γ * Σ P(s'|s,a) * V^π(s')

From a philosophical perspective, Q(s,a) signifies a balance of control and acceptance. The agent makes a definitive choice by picking an action, yet it must accept the uncertain consequences of its decision. It's a vivid representation of the universal struggle between free will and destiny.

In the grand narrative of reinforcement learning, the interplay between V(s) and Q(s,a) is the central act. It's a dance of potential and commitment, a fluctuation between the predictable and the unknown. It's a harmonious symphony that resonates with the fundamental dynamics of life itself. We predict, we act, we embrace the consequences, and we learn. After all, isn't that what life - and reinforcement learning - is all about?

Q^{\pi}(s, a) = R(s, a) + \gamma \sum_{s' \in N} P(s, a, s') \sum_{a' \in A} \pi(s', a') \cdot Q^{\pi}(s', a') \quad \forall s \in N, a \in A

PreviousValue Function Computation in Markov Reward Processes Nextmodule 1 synthesis

Last updated 1 year ago

State-Value and Action-Value Functions

Now, let's dive into the grand spectacle of reinforcement learning, where two stars illuminate our understanding: the State-Value function, V(s), and the Action-Value function, Q(s,a).

V(s) = E[R_t+1 + γR_t+2 + γ^2R_t+3 + ... | S_t = s]

Q(s, a) = E[R_t+1 + γR_t+2 + γ^2R_t+3 + ... | S_t = s, A_t = a]

V^{\pi}(s) = Σ π(a|s) * [R(s,a) + γ * Σ P(s'|s,a) * V^π(s')]

Q^{\pi}(s, a) = R(s, a) + γ * Σ P(s'|s,a) * V^π(s')

Q^{\pi}(s, a) = R(s, a) + \gamma \sum_{s' \in N} P(s, a, s') \sum_{a' \in A} \pi(s', a') \cdot Q^{\pi}(s', a') \quad \forall s \in N, a \in A

PreviousValue Function Computation in Markov Reward Processes Nextmodule 1 synthesis

Last updated 1 year ago