Informally, think of a Process as producing a sequence of random outcomes at discrete time steps that we’ll index by a time variable t = 0, 1, 2, . . ..
The authors introduce the Markov property through three models of market prices. The three processes described are different ways of modeling the evolution of stock prices over time, each with its own unique characteristics:
Process 1: This process models the probability of an increase in stock price as a logistic function of the difference between the current price and a reference level. The steepness of the logistic function is controlled by a parameter α1. This process exhibits mean-reverting behavior, meaning the stock price tends to revert to the reference level over time. The state of this process at any time t is simply the current stock price.
The probability of the state at time t+1 is determined by the difference between the price at time t and a reference level. At first glance, it seems to depend only on the state at time t, with the reference level being a constant. However, in typical mean-reversion models, the reference level is dynamic, such as the average price over 50 time units. This kind of reference level actually encapsulates past price information, condensing the history of past prices into the reference level.
Process 2: This process models the probability of an increase in stock price as a function of the previous price movement. The direction of the next move is biased in the reverse direction of the previous move, and the extent of the bias is controlled by a parameter α2. The state of this process at any time t is a pair consisting of the current stock price and the previous price movement.
The direction of the most recent price change inherently contains information from the previous state, as the direction needs to be calculated by subtracting the previous step from this one. However, by employing a clever workaround, the state at this moment and the situation of the most recent state change are defined as a whole as the state at this moment, accommodating the Markov property in this way.
Process 3: This process extends Process 2 by making the probability of the next movement dependent on all past movements. Specifically, it depends on the number of past up-moves relative to the number of past down-moves. The extent of the "reverse pull" is controlled by a parameter α3. The state of this process at any time t is a pair consisting of the total number of up-moves and down-moves up to time t.
The natural movement of all past prices does not conform to the Markov property, but the pair of information consisting of the total number of rises and falls is defined as a state, and through a small trick, the entire process becomes a Markov process.
In essence, these are not Markov processes, but through some tricks, they are made to conform to the Markov property. What is the significance of doing this? Won't deviating from the essence of price changes lead us away from the truth?
Your question is very insightful. When modeling complex systems such as financial markets, we often need to make a trade-off between the complexity and interpretability of the model. Markov processes are relatively simple models with good mathematical properties, which assume that the future state of the system depends only on the current state, independent of past history. This assumption greatly simplifies the complexity of the model, allowing us to use various mathematical tools to analyze and solve the model.
However, as you pointed out, the dynamic behavior of many real-world systems (including financial markets) does not conform to the Markov property, as their future states may depend on past history. In such cases, we can use some tricks to "force" the system to conform to the Markov property. For example, we can define the state of the system as a vector containing past information, or we can introduce some randomness to simulate the impact of history.
The purpose of these tricks is to capture the key dynamic behaviors of the system as much as possible while keeping the model simple and solvable. However, an important limitation of this approach is that it may not fully capture all the complexities of the system. For instance, if a system's dynamic behavior depends on its long-term history, a simple Markov model may not accurately describe this system.
Therefore, when we use these models, we need to understand their limitations and, where possible, use other methods (such as machine learning or complex systems theory) to supplement the shortcomings of these models.
If we posit that the entire edifice of reinforcement learning is constructed upon the foundation of Markov processes, does it then follow that the application of reinforcement learning to financial price trends inherently inherits the limitations of Markov processes?
Reinforcement Learning (RL) is a machine learning methodology predicated on the framework of Markov Decision Processes (MDPs), which presumes that state transitions and rewards in an environment are contingent solely on the current state and action, independent of historical context. This presumption, known as the Markov property, enables RL algorithms to efficiently estimate and optimize value functions and policies. However, not all practical problems adhere to the Markov property; some exhibit non-Markovian rewards or state transitions, such as delayed rewards, partially observable states, or long-term dependencies. These problems are referred to as Non-Markovian Reinforcement Learning (NMRL) problems.
To address NMRL problems, some approaches strive to restore or approximate the Markov property within the existing MDP framework by expanding the state space, employing memory mechanisms, or introducing auxiliary information. Other methods abandon the MDP framework entirely in favor of more generalized stochastic process models, such as Non-Markovian Decision Processes (NMDPs), Partially Observable Markov Decision Processes (POMDPs), or Fractional Order Dynamic Systems (FODS). These methods necessitate the design of new RL algorithms to accommodate the characteristics of non-Markovianity, employing techniques such as automata learning, importance sampling, or gradient temporal-difference learning.
Therefore, it can be said that the application of reinforcement learning to financial price trends does not necessarily inherit the limitations of Markov processes, but rather selects the appropriate model and algorithm based on the specific characteristics of the problem. Of course, if financial price trends can be approximated as a Markov process, then the use of traditional RL methods may be simpler and more effective.