rl
  • My Journey into Reinforcement Learning and Trading
  • ☺️Foundations of RL with Applications in Finance
    • Abstraction,Immutability,'dataclasses',Type Checking
    • Type variables(generics)
  • Functionality
  • Abstracting over Computation
  • Markov Models for Stock Prices: A Trade-Off Between Simplicity and Accuracy
  • Understanding Markov Processes and their Representations
  • Navigating State Transitions and Rewards: PR, P, RT, and R in MRP
  • Value Function Computation in Markov Reward Processes
  • State-Value and Action-Value Functions
  • module 1 synthesis
  • Strategies for Combining and Coordinating Multiple Utility Functions in Algorithmic Trading: A Multi
Powered by GitBook
On this page

module 1 synthesis

Processes and Planning Algorithms

Markov Decision Processes (MDPs) provide an elegant mathematical framework for modeling sequential decision making under uncertainty. An MDP is defined by a set of states S, a set of actions A, transition probabilities P(s'|s,a) for the next state s' given current state s and action a, and a reward function R(s,a).

At each timestep t, the agent observes state sts_tst​, takes action ata_tat​, transitions to a new random state st+1 s_{t+1} st+1​, and receives reward rt+1r_{t+1}rt+1​. The dynamics satisfy the Markov property - the next state and reward depend only on the current state and action, not the full history.

MDPs can be continuing (infinite horizon) or terminating after a finite number of steps T. The agent's goal is to find a policy π(a∣s)\pi(a|s)π(a∣s)that maps states to probabilities of selecting each action in order to maximize expected cumulative discounted reward:

Gt=∑k=0∞γkRt+k+1G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}Gt​=∑k=0∞​γkRt+k+1​

where γ∈[0,1]\gamma \in [0,1]γ∈[0,1] is the discount factor.

Bellman provided the theoretical foundation for solving MDPs by deriving recursive optimality equations. The optimal value function V∗(s)V^*(s)V∗(s) satisfies:

V∗(s)=max⁡a{R(s,a)+γ∑s′P(s′∣s,a)V∗(s′)}V^*(s) = \max_a \big\{ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \big\}V∗(s)=maxa​{R(s,a)+γ∑s′​P(s′∣s,a)V∗(s′)}

This elegantly captures the principle of optimality - optimal behavior requires optimal behavior in the future. The optimal Q-function has a similar recursive form.

For finite MDPs, algorithms like policy iteration and value iteration leverage the Bellman equation to iteratively improve the policy and value function from any initialization. They provably converge to the optimal policy. Each iteration sweeps through all states to update the value function.

For large or infinite MDPs, representing the value function exactly is intractable. Instead, we use function approximation, where the value function is represented by a parametric model like a neural network. This leads to approximate dynamic programming (ADP) algorithms. Rather than update all states, ADP samples a subset of states, uses the Bellman equation to calculate target values, and updates the value function parameters. ADP enables solving MDPs with large or infinite state spaces. When combined with reinforcement learning, it provides a powerful framework for real-world sequential decision making under uncertainty.

Key concepts like policy evaluation, policy improvement, generalized policy iteration, and backward induction carry over from tabular to approximate dynamic programming. Overall, MDPs and dynamic programming provide a principled approach to optimizing sequential decision making under uncertainty.

A special thanks to Claude2 throghout the synthesis.

PreviousState-Value and Action-Value FunctionsNextStrategies for Combining and Coordinating Multiple Utility Functions in Algorithmic Trading: A Multi

Last updated 1 year ago