Reinforcement Learning
Video Games
State: game display — Action: controller input — Reward: score
Robotics
State: sensor readings — Action: motor commands — Reward: task completion
Finance
State: market data — Action: buy/sell — Reward: profit
Challenges of RL: The Chess Example
Reward: \(+1\) (win) / \(-1\) (loss) / \(0\) (draw or intermediate step)
Sparse rewards
Feedback only arrives at the end of the game.
Temporal credit assignment
A decisive move thirty steps ago caused today’s reward — which action deserves credit?
Stochastic environment
The opponent varies — was a win due to skill or luck?
Exploration vs. exploitation
Try new openings (explore) or stick with what works (exploit)?
Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]
Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]
Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]
Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]
We learn a policy that maximizes the expected return in a Markov decision process (MDP).
Markov Process
Markov Reward Process
Markov Decision Process
Partially Observable MDP (POMDP)
Policy
The RL Loop
At each step, agent and environment interact in a closed loop:
- Agent observes state \(s_t\) and last reward \(r_t\)
- Agent samples action \(a_t \sim \pi[a_t|s_t]\)
- Environment transitions to \(s_{t+1} \sim Pr(s_{t+1}|s_t, a_t)\) and issues \(r_{t+1} \sim Pr(r_{t+1}|s_t, a_t)\)
- Repeat
Goal: choose a policy that maximizes the expected return. We make this precise by assigning a value to each state \(s_t\) and state-action pair \(\{s_t, a_t\}\).
The return \(G_t\) is stochastic — the policy \(\pi[a_t|s_t]\), transitions \(Pr(s_{t+1}|s_t,a_t)\), and rewards \(Pr(r_{t+1}|s_t,a_t)\) are all stochastic, so trajectories differ each run.
State and Action Values
State value (state-value function)
\[v[s_t|\pi] = \E\bigl[G_t\big|s_t,\pi\bigr]\]
Expected return starting from \(s_t\) and following \(\pi\) thereafter. Highest for states where large rewards are likely soon.
Action value (state-action value function)
\[q[s_t, a_t|\pi] = \E\bigl[G_t\big|s_t, a_t, \pi\bigr]\]
Expected return from taking action \(a_t\) in state \(s_t\), then following \(\pi\). Connects future rewards to current actions — resolves the temporal credit assignment problem.
Optimal Policy
Optimal state value
\[v^*[s_t] = \max_{\pi}\,\E\bigl[G_t\big|s_t,\pi\bigr]\]
Optimal action value
\[q^*[s_t,a_t] = \max_{\pi}\,\E\bigl[G_t\big|s_t,a_t,\pi\bigr]\]
If we know \(q^*[s_t,a_t]\), the optimal policy follows immediately — just pick the best action:
\[\pi[a_t|s_t] \leftarrow \argmax_{a_t}\bigl[q^*[s_t,a_t]\bigr]\]
Bellman Equations
State value and action value are mutually consistent:
\(v\) from \(q\) — weighted average of action values under the policy:
\[v[s_t] = \sum_{a_t} \pi[a_t|s_t]\, q[s_t, a_t]\]
\(q\) from \(v\) — immediate reward plus discounted value of next state:
\[q[s_t, a_t] = r[s_t,a_t] + \gamma\!\sum_{s_{t+1}}\! Pr(s_{t+1}|s_t,a_t)\,v[s_{t+1}]\]
Substituting each equation into the other yields the Bellman equations — self-consistency relations that link values at time \(t\) to values at time \(t+1\):
\[v[s_t] = \sum_{a_t}\pi[a_t|s_t]\!\left(r[s_t,a_t] + \gamma\!\sum_{s_{t+1}}\!Pr(s_{t+1}|s_t,a_t)\,v[s_{t+1}]\right)\]
\[q[s_t,a_t] = r[s_t,a_t] + \gamma\!\sum_{s_{t+1}}\!Pr(s_{t+1}|s_t,a_t)\!\left(\sum_{a_{t+1}}\!\pi[a_{t+1}|s_{t+1}]\,q[s_{t+1},a_{t+1}]\right)\]
Tabular RL algorithms (no function approximation) split into two camps:
Model-based
Use the MDP structure explicitly — transition matrix \(Pr(s_{t+1}|s_t,a_t)\) and reward structure \(r[s,a]\).
- If known \(\to\) solved by dynamic programming
- If unknown \(\to\) estimate from observed trajectories
Model-free
Transition matrix and reward structure are unknown. Two families:
- Value estimation — estimate optimal \(q[s,a]\), then set \(\pi\) to the greedy action
- Policy estimation — directly optimize \(\pi\) via gradient descent
Within each family, algorithms differ in when they update:
Monte Carlo (MC)
Simulate many complete trajectories (episodes), then update.
Temporal Difference (TD)
Update the policy while the agent traverses the MDP — no need to wait for episode end.
Dynamic Programming
Initialize \(v[s]\) to zero and \(\pi[a|s]\) randomly, then alternate:
Policy evaluation — sweep through states, bootstrapping from neighbors:
\[v[s_t] \leftarrow \sum_{a_t}\!\pi[a_t|s_t]\!\left(r[s_t,a_t] + \gamma\!\sum_{s_{t+1}}\!Pr(s_{t+1}|s_t,a_t)\,v[s_{t+1}]\right)\]
Policy improvement — greedily choose the action with the highest value:
\[\pi[a_t|s_t] \leftarrow \argmax_{a_t}\!\left[r[s_t,a_t] + \gamma\!\sum_{s_{t+1}}\!Pr(s_{t+1}|s_t,a_t)\,v[s_{t+1}]\right]\]
Guaranteed to improve the policy (policy improvement theorem).
Variations:
| Method | Description |
|---|---|
| Policy iteration | Fully evaluate the policy to convergence before each improvement step |
| Value iteration | One evaluation sweep per improvement step |
| Asynchronous DP | Update a subset of states in arbitrary order at each step |
Monte Carlo Methods
No knowledge of the MDP structure — instead, gain experience by repeatedly sampling trajectories and observing rewards. Alternates between:
- Estimating \(q[s,a]\) as the average empirical return each time a state-action pair is visited
- Improving the policy greedily: \(\pi[a|s] \leftarrow \argmax_a\bigl[q[s,a]\bigr]\)
On-policy vs. off-policy:
On-policy
The current policy guides exploration — but unexplored actions can never be estimated.
- Exploring starts: initiate episodes from every state-action pair (impractical for large state spaces)
- \(\epsilon\)-greedy: take a random action with prob \(\epsilon\), the optimal action with prob \(1-\epsilon\) — trades off exploration and exploitation
Off-policy
Learn a deterministic target policy \(\pi\) using data collected by a stochastic behavior policy \(\pi'\) (e.g., \(\epsilon\)-greedy).
- Behavior policy explores; target policy stays efficient
- Some methods use importance sampling to correct for the mismatch; others (e.g., Q-learning) estimate values from the greedy action regardless of what was taken
Temporal Difference Methods
TD methods combine bootstrapping (like DP) with sampling (like MC), updating values while the agent traverses the MDP — no waiting for episode end.
SARSA (on-policy) — uses the next action \(a_{t+1}\) sampled from the current policy:
\[q[s_t,a_t] \leftarrow q[s_t,a_t] + \alpha\Bigl(r[s_t,a_t] + \gamma\cdot q[s_{t+1},a_{t+1}] - q[s_t,a_t]\Bigr)\]
The bracketed term is the TD error — measures inconsistency between \(q[s_t,a_t]\) and the one-step estimate \(r[s_t,a_t] + \gamma\cdot q[s_{t+1},a_{t+1}]\).
Q-Learning (off-policy) — uses the greedy action at \(s_{t+1}\), regardless of what the behavior policy \(\pi'\) actually chose:
\[q[s_t,a_t] \leftarrow q[s_t,a_t] + \alpha\Bigl(r[s_t,a_t] + \gamma\cdot\max_a\bigl[q[s_{t+1},a]\bigr] - q[s_t,a_t]\Bigr)\]
In both cases the policy is updated by \(\pi[a|s] \leftarrow \argmax_a[q[s,a]]\). These updates are contraction mappings and converge to the true action values, provided every state-action pair is visited infinitely often.
Tabular methods update \(q[s_t, a_t]\) for every state-action pair — only practical when the state-action space is small. Even a chessboard has \(>10^{40}\) legal states.
Loss — based on the Q-learning consistency condition:
\[L[\phi] = \left(r[\mathbf{s}_t, a_t] + \gamma \cdot \max_a\bigl[q[\mathbf{s}_{t+1}, a, \phi]\bigr] - q[\mathbf{s}_t, a_t, \phi]\right)^2\]
Gradient update:
\[\phi \leftarrow \phi + \alpha\!\left(r[\mathbf{s}_t, a_t] + \gamma \cdot \max_a\bigl[q[\mathbf{s}_{t+1}, a, \phi]\bigr] - q[\mathbf{s}_t, a_t, \phi]\right)\frac{\partial q[\mathbf{s}_t, a_t, \phi]}{\partial \phi}\]
Deep Q-Networks (DQN) for ATARI Games
Deep networks naturally handle high-dimensional states — the network takes state as input and predicts action values for all actions simultaneously.
State representation: \(220\times160\) ATARI frames (128 colors) → resized to \(84\times84\) grayscale → last 4 frames stacked to capture velocity.
Training tricks:
| Trick | Description |
|---|---|
| Reward clipping | Score changes clipped to \(\pm1\) — normalizes rewards across 49 games |
| Experience replay | Tuples \(\langle\mathbf{s}_t,a_t,r_{t+1},\mathbf{s}_{t+1}\rangle\) stored in a buffer; random batches sampled — reduces correlation between adjacent frames |
| Fixed target network | Target parameters \(\phi^-\) held fixed and updated only periodically — network no longer chases a moving target |
With the fixed target network, the update becomes:
\[\phi \leftarrow \phi + \alpha\!\left(r[\mathbf{s}_t, a_t] + \gamma \cdot \max_a\bigl[q[\mathbf{s}_{t+1}, a, \phi^-]\bigr] - q[\mathbf{s}_t, a_t, \phi]\right)\frac{\partial q[\mathbf{s}_t, a_t, \phi]}{\partial \phi}\]
Double Q-Learning and Double DQNs
Problem: the \(\max\) in the Q-learning update introduces systematic overestimation — stochastic rewards or random network outputs occasionally produce inflated values that get selected by the max, biasing \(q[s_t,a_t]\) upward.
The root cause: the same network both selects the best action and evaluates it.
Fix — train two models \(q_1\) and \(q_2\) simultaneously; each selects the action, the other evaluates it:
\[q_1[s_t,a_t] \leftarrow q_1[s_t,a_t] + \alpha\!\left(r[s_t,a_t] + \gamma\cdot q_2\!\left[s_{t+1},\,\argmax_a\bigl[q_1[s_{t+1},a]\bigr]\right] - q_1[s_t,a_t]\right)\]
\[q_2[s_t,a_t] \leftarrow q_2[s_t,a_t] + \alpha\!\left(r[s_t,a_t] + \gamma\cdot q_1\!\left[s_{t+1},\,\argmax_a\bigl[q_2[s_{t+1},a]\bigr]\right] - q_2[s_t,a_t]\right)\]
New tuples are randomly assigned to update one model or the other. Double DQNs apply this idea with deep networks \(q[\mathbf{s}_t,a_t,\phi_1]\) and \(q[\mathbf{s}_t,a_t,\phi_2]\):
\[\phi_1 \leftarrow \phi_1 + \alpha\!\left(r[\mathbf{s}_t,a_t] + \gamma\cdot q\!\left[\mathbf{s}_{t+1},\argmax_a\bigl[q[\mathbf{s}_{t+1},a,\phi_1]\bigr],\phi_2\right] - q[\mathbf{s}_t,a_t,\phi_1]\right)\frac{\partial q[\mathbf{s}_t,a_t,\phi_1]}{\partial\phi_1}\]
\[\phi_2 \leftarrow \phi_2 + \alpha\!\left(r[\mathbf{s}_t,a_t] + \gamma\cdot q\!\left[\mathbf{s}_{t+1},\argmax_a\bigl[q[\mathbf{s}_{t+1},a,\phi_2]\bigr],\phi_1\right] - q[\mathbf{s}_t,a_t,\phi_2]\right)\frac{\partial q[\mathbf{s}_t,a_t,\phi_2]}{\partial\phi_2}\]
Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]
Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]
Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]
Q-learning estimates action values first, then derives a policy. Policy-based methods skip the middleman and directly learn a stochastic policy \(\pi[a_t|\mathbf{s}_t, \boldsymbol{\theta}]\) — a parametric function mapping state \(\mathbf{s}_t\) to a distribution \(Pr(a_t|\mathbf{s}_t)\) over actions.
In MDPs there is always an optimal deterministic policy — yet there are three strong reasons to use a stochastic one:
Exploration
We are not obliged to take the best known action at every step — the policy naturally explores the space.
Smooth loss
The loss changes smoothly as \(\boldsymbol{\theta}\) changes → gradient descent works even with discrete rewards (analogous to maximum likelihood in classification).
Partial observability
When the state is not fully observed (see POMDP section), two visually identical observations may require different actions. A stochastic policy can take different actions until the ambiguity is resolved.
Derivation of the Policy Gradient
Consider a trajectory \(\boldsymbol{\tau} = [\mathbf{s}_1, a_1, \mathbf{s}_2, a_2, \ldots, \mathbf{s}_T, a_T]\). Its probability under the current policy is:
\[Pr(\boldsymbol{\tau}|\boldsymbol{\theta}) = Pr(\mathbf{s}_1)\prod_{t=1}^{T}\pi[a_t|\mathbf{s}_t,\boldsymbol{\theta}]\,Pr(\mathbf{s}_{t+1}|\mathbf{s}_t,a_t)\]
Objective — maximize expected return over trajectories:
\[\boldsymbol{\theta} = \argmax_{\boldsymbol{\theta}}\Bigl[\mathbb{E}_{\boldsymbol{\tau}}\bigl[r[\boldsymbol{\tau}]\bigr]\Bigr] = \argmax_{\boldsymbol{\theta}}\left[\int Pr(\boldsymbol{\tau}|\boldsymbol{\theta})\,r[\boldsymbol{\tau}]\,d\boldsymbol{\tau}\right]\]
Gradient ascent — multiply and divide by \(Pr(\boldsymbol{\tau}|\boldsymbol{\theta})\) to convert to an expectation, then approximate with \(I\) observed trajectories:
\[\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha\cdot\int\frac{\partial Pr(\boldsymbol{\tau}|\boldsymbol{\theta})}{\partial\boldsymbol{\theta}}\,r[\boldsymbol{\tau}]\,d\boldsymbol{\tau} \approx \boldsymbol{\theta} + \alpha\cdot\frac{1}{I}\sum_{i=1}^{I}\frac{1}{Pr(\boldsymbol{\tau}_i|\boldsymbol{\theta})}\frac{\partial Pr(\boldsymbol{\tau}_i|\boldsymbol{\theta})}{\partial\boldsymbol{\theta}}\,r[\boldsymbol{\tau}_i]\]
Likelihood ratio identity \(\left(\frac{\partial \log f[z]}{\partial z} = \frac{1}{f[z]}\frac{\partial f[z]}{\partial z}\right)\) simplifies the update to:
\[\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha\cdot\frac{1}{I}\sum_{i=1}^{I}\frac{\partial\log\bigl[Pr(\boldsymbol{\tau}_i|\boldsymbol{\theta})\bigr]}{\partial\boldsymbol{\theta}}\,r[\boldsymbol{\tau}_i]\]
Log-trajectory decomposition — expanding \(\log[Pr(\boldsymbol{\tau}|\boldsymbol{\theta})]\):
\[\log[Pr(\boldsymbol{\tau}|\boldsymbol{\theta})] = \log[Pr(\mathbf{s}_1)] + \sum_{t=1}^{T}\log[\pi[a_t|\mathbf{s}_t,\boldsymbol{\theta}]] + \sum_{t=1}^{T}\log[Pr(\mathbf{s}_{t+1}|\mathbf{s}_t,a_t)]\]
Only the middle term depends on \(\boldsymbol{\theta}\), giving the REINFORCE update:
\[\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha\cdot\frac{1}{I}\sum_{i=1}^{I}\sum_{t=1}^{T}\frac{\partial\log[\pi[a_{it}|\mathbf{s}_{it},\boldsymbol{\theta}]]}{\partial\boldsymbol{\theta}}\,r[\boldsymbol{\tau}_i]\]
State transition terms \(Pr(\mathbf{s}_{t+1}|\mathbf{s}_t,a_t)\) vanish — the update does not require a Markov model.
Further simplification — since past rewards cannot be affected by the action at time \(t\), only future rewards matter:
REINFORCE Algorithm
REINFORCE is an early policy gradient algorithm that incorporates discounting. It is a Monte Carlo method that generates episodes \(\boldsymbol{\tau}_i = [\mathbf{s}_{i1}, a_{i1}, r_{i2}, \mathbf{s}_{i2}, a_{i2}, r_{i3}, \ldots, r_{iT}]\) under the current policy \(\pi[a|\mathbf{s},\boldsymbol{\theta}]\).
For discrete actions, \(\pi[\mathbf{s}|\boldsymbol{\theta}]\) is a neural network that takes state \(\mathbf{s}\) and outputs one value per action — passed through a softmax to form a distribution, then sampled at each step.
For each episode \(i\) and time step \(t\), compute the discounted return of the partial trajectory \(\boldsymbol{\tau}_{it}\) starting at \(t\):
\[r[\boldsymbol{\tau}_{it}] = \sum_{k=t+1}^{T}\gamma^{k-1}\,r_{i,k}\]
Then update parameters for every \((i,t)\):
Baselines
Policy gradient estimates have high variance — many episodes may be needed for stable gradient estimates. One fix: subtract a baseline \(b\) from the returns:
\[\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha\cdot\frac{1}{I}\sum_{i=1}^{I}\sum_{t=1}^{T}\frac{\partial\log[\pi_{a_{it}}[\mathbf{s}_{it},\boldsymbol{\theta}]]}{\partial\boldsymbol{\theta}}\bigl(r[\boldsymbol{\tau}_{it}] - b\bigr)\]
As long as \(b\) does not depend on the actions, the expected value is unchanged:
\[\mathbb{E}_{\boldsymbol{\tau}}\!\left[\sum_{t=1}^{T}\frac{\partial\log[\pi_{a_{it}}[\mathbf{s}_{it},\boldsymbol{\theta}]]}{\partial\boldsymbol{\theta}}\cdot b\right] = 0\]
If \(b\) co-varies with irrelevant factors that add noise, subtracting it reduces variance without introducing bias (control variates).
Optimal baseline (minimizes variance):
\[b = \sum_i\frac{\sum_{t=1}^{T}\!\bigl(\partial\log[\pi_{a_{it}}[\mathbf{s}_{it},\boldsymbol{\theta}]]/\partial\boldsymbol{\theta}\bigr)^2 r[\boldsymbol{\tau}_{it}]}{\sum_{t=1}^{T}\!\bigl(\partial\log[\pi_{a_{it}}[\mathbf{s}_{it},\boldsymbol{\theta}]]/\partial\boldsymbol{\theta}\bigr)^2}\]
In practice, this is approximated by the mean return:
\[b \approx \frac{1}{I}\sum_i r[\boldsymbol{\tau}_i]\]
This removes variance that occurs when all trajectories pass through high-return states regardless of the actions taken.
State-Dependent Baselines
A better baseline depends on the current state \(\mathbf{s}_{it}\):
\[\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha\cdot\frac{1}{I}\sum_{i=1}^{I}\sum_{t=1}^{T}\frac{\partial\log[\pi_{a_{it}}[\mathbf{s}_{it},\boldsymbol{\theta}]]}{\partial\boldsymbol{\theta}}\bigl(r[\boldsymbol{\tau}_{it}] - b[\mathbf{s}_{it}]\bigr)\]
This compensates for variance introduced by some states having greater overall returns than others, regardless of the actions taken.
Actor-critic algorithms are TD policy gradient methods — they can update the policy network at each step, without waiting for an episode to complete (unlike Monte Carlo REINFORCE).
In the TD approach we don’t have access to all future rewards \(r[\boldsymbol{\tau}_t] = \sum_{k=t}^{T}r_k\). Instead, approximate with the current reward plus the discounted value of the next state:
\[r[\boldsymbol{\tau}_{it}] \approx r_{i,t+1} + \gamma\cdot v[\mathbf{s}_{i,t+1},\boldsymbol{\phi}]\]
where \(v[\mathbf{s},\boldsymbol{\phi}]\) is a second neural network (the critic) estimating the state value.
Substituting into the state-dependent baseline update (with \(b[\mathbf{s}_{it}] = v[\mathbf{s}_{it},\boldsymbol{\phi}]\)) gives the actor update:
Actor — policy network \(\pi[\mathbf{s}_t,\boldsymbol{\theta}]\):
\[\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha\cdot\frac{1}{I}\sum_{i=1}^{I}\sum_{t=1}^{T}\frac{\partial\log[Pr(a_{it}|\mathbf{s}_{it},\boldsymbol{\theta})]}{\partial\boldsymbol{\theta}}\Bigl(r_{i,t+1} + \gamma\cdot v[\mathbf{s}_{i,t+1},\boldsymbol{\phi}] - v[\mathbf{s}_{i,t},\boldsymbol{\phi}]\Bigr)\]
Concurrently, update the critic by bootstrapping:
Critic — value network \(v[\mathbf{s}_t,\boldsymbol{\phi}]\):
\[L[\boldsymbol{\phi}] = \sum_{i=1}^{I}\sum_{t=1}^{T}\Bigl(r_{i,t+1} + \gamma\cdot v[\mathbf{s}_{i,t+1},\boldsymbol{\phi}] - v[\mathbf{s}_{i,t},\boldsymbol{\phi}]\Bigr)^2\]
In standard RL the agent must interact with the environment to learn. This is impractical when:
Dangerous exploration
e.g., autonomous vehicles — erratic behavior during learning is unacceptable.
Expensive data collection
e.g., financial trading — each exploratory action has real cost.
Decision Transformer
Offline RL opens a new framing: treat it as a sequence prediction problem — predict the next action given the history of states, rewards, and actions. The decision transformer exploits a transformer decoder for this.
Key modification — returns-to-go (RTG): replace \(r_t\) with \(R_{t:T} = \sum_{t'=t}^{T}r_{t'}\) (remaining future rewards), so the model conditions on how much reward is still to come.
Architecture: states, actions, and RTG are each mapped to fixed-size embeddings (states via CNN for images; actions and RTG via learned embeddings). Trained with masked self-attention and position embeddings.
At inference: specify the desired total return at step 1 and decrement as rewards arrive (e.g., in Atari, set it to the score required to win).