$$ % Define your custom commands here \newcommand{\bmat}[1]{\begin{bmatrix}#1\end{bmatrix}} \newcommand{\E}{\mathbb{E}} \newcommand{\P}{\mathbb{P}} \newcommand{\S}{\mathbb{S}} \newcommand{\R}{\mathbb{R}} \newcommand{\S}{\mathbb{S}} \newcommand{\norm}[2]{\|{#1}\|_{{}_{#2}}} \newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\pdd}[2]{\frac{\partial^2 #1}{\partial #2^2}} \newcommand{\vectornorm}[1]{\left|\left|#1\right|\right|} \newcommand{\abs}[1]{\left|{#1}\right|} \newcommand{\mbf}[1]{\mathbf{#1}} \newcommand{\mc}[1]{\mathcal{#1}} \newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\nicefrac}[2]{{}^{#1}\!/_{\!#2}} \newcommand{\argmin}{\operatorname*{arg\,min}} \newcommand{\argmax}{\operatorname*{arg\,max}} \newcommand{\dd}{\operatorname{d}\!} $$

Reinforcement Learning

Reinforcement learning (RL) — a sequential decision-making framework where an agent learns to take actions in an environment to maximize cumulative rewards.

Video Games

State: game display — Action: controller input — Reward: score

Robotics

State: sensor readings — Action: motor commands — Reward: task completion

Finance

State: market data — Action: buy/sell — Reward: profit

Challenges of RL: The Chess Example

Reward: $+1$ (win) / $-1$ (loss) / $0$ (draw or intermediate step)

Sparse rewards

Feedback only arrives at the end of the game.

Temporal credit assignment

A decisive move thirty steps ago caused today’s reward — which action deserves credit?

Stochastic environment

The opponent varies — was a win due to skill or luck?

Exploration vs. exploitation

Try new openings (explore) or stick with what works (exploit)?

A deep network maps the board state to a probability over possible moves — this mapping is called a policy. — A deep network maps the board state to a probability over possible moves — this mapping is called a *policy*.

RL does not require deep learning, but state-of-the-art systems use deep networks to encode environment observations and map them to actions.

Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]

We learn a policy that maximizes the expected return in a Markov decision process (MDP).

Markov Process

Markov process — the world occupies one of a set of states $s_t$. Transitions are governed by transition probabilities $Pr(s_{t+1}|s_t)$.

Markov property: next state depends only on the current state, not history
Produces a trajectory $\tau = [s_1, s_2, s_3, \ldots]$

a) 16 positions (states) on an icy grid. b) From any position, the penguin moves to each adjacent state with equal probability (e.g., 25% each from position 6).

Markov Reward Process

Markov reward process — adds rewards $r_{t+1} \sim Pr(r_{t+1}|s_t)$ to the Markov process.

Trajectory: $s_1, r_2, s_2, r_3, s_3, r_4, \ldots$
Return at time $t$ (with discount factor $\gamma \in (0,1]$):

\[G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k+1}\]

$\gamma < 1$ makes near-term rewards more valuable than distant ones

Penguin gets +1 for each fish, 0 otherwise (\gamma = 0.9). Return G_t increases as the penguin gets closer to the fish. — Penguin gets $+1$ for each fish, $0$ otherwise ($\gamma = 0.9$). Return $G_t$ increases as the penguin gets closer to the fish.

Markov Decision Process

MDP — extends the Markov reward process by adding actions $a_t$.

Transition probabilities: $Pr(s_{t+1}|s_t, a_t)$
Rewards: $Pr(r_{t+1}|s_t, a_t)$
Trajectory: $(s_1,a_1,r_2),\,(s_2,a_2,r_3),\,(s_3,a_3,r_4),\ldots$
The entity choosing actions is the agent

a) Agent chooses one of four actions (up/right/down/left) at each step. b-c) Action changes transition probabilities — the penguin moves in the intended direction with 50% probability but may slip. Reward is received upon leaving a fish square. — a) Agent chooses one of four actions (up/right/down/left) at each step. **b-c)** Action changes transition probabilities — the penguin moves in the intended direction with 50% probability but may slip. Reward is received upon *leaving* a fish square.

Partially Observable MDP (POMDP)

POMDP — the state $s_t$ is not directly visible; instead the agent receives an observation $o_t \sim Pr(o_t|s_t)$.

Trajectory: $s_1, o_1, a_1, r_2, s_2, o_2, a_2, r_3, o_3, a_3, s_3, r_4, \ldots$
Each observation is more compatible with some states than others, but insufficient to identify the state uniquely
Example: states 3 and 9 look identical locally, yet moving right yields $-2$ (hole) from state 3 and $+3$ (fish) from state 9

The penguin can only see tiles in its vicinity (dashed box). States 3 and 9 are visually indistinguishable, yet lead to very different outcomes when moving right.

Policy

A policy $\pi[a|s]$ defines the rules by which the agent selects actions.

	Definition	Behavior
Stochastic	$\pi[a\|s]$ is a probability distribution over actions	Sample $a \sim \pi[\cdot\|s]$
Deterministic	$\pi[a\|s] = 1$ for one action, $0$ otherwise	Always picks the same action in state $s$

A stationary policy depends only on the current state $s_t$
A non-stationary policy also depends on the time step $t$
Stochastic policies encourage exploration, which can be essential in POMDPs

a) Deterministic policy — one fixed action per state. b) A more random deterministic policy. c) Stochastic policy — arrow size indicates probability; encourages thorough exploration.

The RL Loop

At each step, agent and environment interact in a closed loop:

Agent observes state $s_t$ and last reward $r_t$
Agent samples action $a_t \sim \pi[a_t|s_t]$
Environment transitions to $s_{t+1} \sim Pr(s_{t+1}|s_t, a_t)$ and issues $r_{t+1} \sim Pr(r_{t+1}|s_t, a_t)$
Repeat

The agent applies its policy \pi[a_t|s_t] to choose an action; the environment returns the next state and reward via its transition and reward functions. — The agent applies its policy $\pi[a_t|s_t]$ to choose an action; the environment returns the next state and reward via its transition and reward functions.

Goal: choose a policy that maximizes the expected return. We make this precise by assigning a value to each state $s_t$ and state-action pair $\{s_t, a_t\}$.

The return $G_t$ is stochastic — the policy $\pi[a_t|s_t]$, transitions $Pr(s_{t+1}|s_t,a_t)$, and rewards $Pr(r_{t+1}|s_t,a_t)$ are all stochastic, so trajectories differ each run.

State and Action Values

State value (state-value function)

\[v[s_t|\pi] = \E\bigl[G_t\big|s_t,\pi\bigr]\]

Expected return starting from $s_t$ and following $\pi$ thereafter. Highest for states where large rewards are likely soon.

Action value (state-action value function)

\[q[s_t, a_t|\pi] = \E\bigl[G_t\big|s_t, a_t, \pi\bigr]\]

Expected return from taking action $a_t$ in state $s_t$, then following $\pi$. Connects future rewards to current actions — resolves the temporal credit assignment problem.

a) State values v[s_t|\pi] — states closer to the fish are more valuable. b) Action values q[s_t,a_t|\pi] — four values per state (one per action); larger for actions heading toward the fish. c) Improved policy: always pick the action with the highest value (red numbers in b). — a) State values $v[s_t|\pi]$ — states closer to the fish are more valuable. b) Action values $q[s_t,a_t|\pi]$ — four values per state (one per action); larger for actions heading toward the fish. c) Improved policy: always pick the action with the highest value (red numbers in b).

Optimal Policy

For MDPs (but not POMDPs), there always exists a deterministic, stationary policy that simultaneously maximizes the value of every state.

Optimal state value

\[v^*[s_t] = \max_{\pi}\,\E\bigl[G_t\big|s_t,\pi\bigr]\]

Optimal action value

\[q^*[s_t,a_t] = \max_{\pi}\,\E\bigl[G_t\big|s_t,a_t,\pi\bigr]\]

If we know $q^*[s_t,a_t]$, the optimal policy follows immediately — just pick the best action:

\[\pi[a_t|s_t] \leftarrow \argmax_{a_t}\bigl[q^*[s_t,a_t]\bigr]\]

Many RL algorithms alternate between estimating action values and improving the policy based on those estimates.

Bellman Equations

For simplicity, we now write $v[s_t]$ and $q[s_t,a_t]$ (dropping the $\pi$ dependence), and assume deterministic rewards $r[s_t,a_t]$.

State value and action value are mutually consistent:

$v$ from $q$ — weighted average of action values under the policy:

\[v[s_t] = \sum_{a_t} \pi[a_t|s_t]\, q[s_t, a_t]\]

$q$ from $v$ — immediate reward plus discounted value of next state:

\[q[s_t, a_t] = r[s_t,a_t] + \gamma\!\sum_{s_{t+1}}\! Pr(s_{t+1}|s_t,a_t)\,v[s_{t+1}]\]

v[s_t] is a weighted sum of action values q[s_t,a_t], where weights are the policy probabilities \pi[a_t|s_t]. — $v[s_t]$ is a weighted sum of action values $q[s_t,a_t]$, where weights are the policy probabilities $\pi[a_t|s_t]$.

q[s_t,a_t] equals the immediate reward plus a discounted weighted sum of successor state values v[s_{t+1}]. — $q[s_t,a_t]$ equals the immediate reward plus a discounted weighted sum of successor state values $v[s_{t+1}]$.

Substituting each equation into the other yields the Bellman equations — self-consistency relations that link values at time $t$ to values at time $t+1$:

\[v[s_t] = \sum_{a_t}\pi[a_t|s_t]\!\left(r[s_t,a_t] + \gamma\!\sum_{s_{t+1}}\!Pr(s_{t+1}|s_t,a_t)\,v[s_{t+1}]\right)\]

\[q[s_t,a_t] = r[s_t,a_t] + \gamma\!\sum_{s_{t+1}}\!Pr(s_{t+1}|s_t,a_t)\!\left(\sum_{a_{t+1}}\!\pi[a_{t+1}|s_{t+1}]\,q[s_{t+1},a_{t+1}]\right)\]

The Bellman equations are the backbone of many RL methods. They say that state/action values must be self-consistent: updating one value propagates as a ripple to all others.

Tabular RL algorithms (no function approximation) split into two camps:

Model-based

Use the MDP structure explicitly — transition matrix $Pr(s_{t+1}|s_t,a_t)$ and reward structure $r[s,a]$.

If known $\to$ solved by dynamic programming
If unknown $\to$ estimate from observed trajectories

Model-free

Transition matrix and reward structure are unknown. Two families:

Value estimation — estimate optimal $q[s,a]$, then set $\pi$ to the greedy action
Policy estimation — directly optimize $\pi$ via gradient descent

Within each family, algorithms differ in when they update:

Monte Carlo (MC)

Simulate many complete trajectories (episodes), then update.

Temporal Difference (TD)

Update the policy while the agent traverses the MDP — no need to wait for episode end.

Terminology: A trajectory is an observed sequence of states, rewards, and actions. A rollout is a simulated trajectory. An episode is a trajectory from an initial state to a terminal state (e.g., a full chess game).

Dynamic Programming

Assumes perfect knowledge of $Pr(s_{t+1}|s_t,a_t)$ and $r[s,a]$ — unlike most RL methods which must estimate these from interaction.

Initialize $v[s]$ to zero and $\pi[a|s]$ randomly, then alternate:

Policy evaluation — sweep through states, bootstrapping from neighbors:

\[v[s_t] \leftarrow \sum_{a_t}\!\pi[a_t|s_t]\!\left(r[s_t,a_t] + \gamma\!\sum_{s_{t+1}}\!Pr(s_{t+1}|s_t,a_t)\,v[s_{t+1}]\right)\]

Policy improvement — greedily choose the action with the highest value:

\[\pi[a_t|s_t] \leftarrow \argmax_{a_t}\!\left[r[s_t,a_t] + \gamma\!\sum_{s_{t+1}}\!Pr(s_{t+1}|s_t,a_t)\,v[s_{t+1}]\right]\]

Guaranteed to improve the policy (policy improvement theorem).

a) Values initialized to zero, random policy. b) After two evaluation-improvement iterations. c) Converged optimal policy — avoids holes, reaches the fish.

Variations:

Method	Description
Policy iteration	Fully evaluate the policy to convergence before each improvement step
Value iteration	One evaluation sweep per improvement step
Asynchronous DP	Update a subset of states in arbitrary order at each step

Monte Carlo Methods

No knowledge of the MDP structure — instead, gain experience by repeatedly sampling trajectories and observing rewards. Alternates between:

Estimating $q[s,a]$ as the average empirical return each time a state-action pair is visited
Improving the policy greedily: $\pi[a|s] \leftarrow \argmax_a\bigl[q[s,a]\bigr]$

a) Random initial policy; MDP simulated repeatedly. b) Action values estimated by averaging observed returns per state-action pair. c) Policy updated to the best-valued action at each state.

On-policy vs. off-policy:

On-policy

The current policy guides exploration — but unexplored actions can never be estimated.

Exploring starts: initiate episodes from every state-action pair (impractical for large state spaces)
$\epsilon$-greedy: take a random action with prob $\epsilon$, the optimal action with prob $1-\epsilon$ — trades off exploration and exploitation

Off-policy

Learn a deterministic target policy $\pi$ using data collected by a stochastic behavior policy $\pi'$ (e.g., $\epsilon$-greedy).

Behavior policy explores; target policy stays efficient
Some methods use importance sampling to correct for the mismatch; others (e.g., Q-learning) estimate values from the greedy action regardless of what was taken

Temporal Difference Methods

TD methods combine bootstrapping (like DP) with sampling (like MC), updating values while the agent traverses the MDP — no waiting for episode end.

SARSA (on-policy) — uses the next action $a_{t+1}$ sampled from the current policy:

\[q[s_t,a_t] \leftarrow q[s_t,a_t] + \alpha\Bigl(r[s_t,a_t] + \gamma\cdot q[s_{t+1},a_{t+1}] - q[s_t,a_t]\Bigr)\]

The bracketed term is the TD error — measures inconsistency between $q[s_t,a_t]$ and the one-step estimate $r[s_t,a_t] + \gamma\cdot q[s_{t+1},a_{t+1}]$.

Q-Learning (off-policy) — uses the greedy action at $s_{t+1}$, regardless of what the behavior policy $\pi'$ actually chose:

\[q[s_t,a_t] \leftarrow q[s_t,a_t] + \alpha\Bigl(r[s_t,a_t] + \gamma\cdot\max_a\bigl[q[s_{t+1},a]\bigr] - q[s_t,a_t]\Bigr)\]

a) Agent in state s_t takes action a_t=2 (move down), receives reward 0. b) Maximum action value at the new state is found (0.43). c) q[s_t,a_t] updated to 1.12 using \alpha=0.1, \gamma=0.9 — the policy at the original state changes. — a) Agent in state $s_t$ takes action $a_t=2$ (move down), receives reward $0$. b) Maximum action value at the new state is found (0.43). c) $q[s_t,a_t]$ updated to 1.12 using $\alpha=0.1$, $\gamma=0.9$ — the policy at the original state changes.

In both cases the policy is updated by $\pi[a|s] \leftarrow \argmax_a[q[s,a]]$. These updates are contraction mappings and converge to the true action values, provided every state-action pair is visited infinitely often.

Tabular methods update $q[s_t, a_t]$ for every state-action pair — only practical when the state-action space is small. Even a chessboard has $>10^{40}$ legal states.

Fitted Q-learning replaces the discrete table $q[s_t, a_t]$ with a machine learning model $q[\mathbf{s}_t, a_t, \phi]$, where the state is now a vector $\mathbf{s}_t$ rather than an index.

Loss — based on the Q-learning consistency condition:

\[L[\phi] = \left(r[\mathbf{s}_t, a_t] + \gamma \cdot \max_a\bigl[q[\mathbf{s}_{t+1}, a, \phi]\bigr] - q[\mathbf{s}_t, a_t, \phi]\right)^2\]

Gradient update:

\[\phi \leftarrow \phi + \alpha\!\left(r[\mathbf{s}_t, a_t] + \gamma \cdot \max_a\bigl[q[\mathbf{s}_{t+1}, a, \phi]\bigr] - q[\mathbf{s}_t, a_t, \phi]\right)\frac{\partial q[\mathbf{s}_t, a_t, \phi]}{\partial \phi}\]

Convergence is no longer guaranteed. Updating $\phi$ simultaneously shifts both the prediction $q[\mathbf{s}_t, a_t, \phi]$ and the target $r[\mathbf{s}_t, a_t] + \gamma \cdot \max_a[q[\mathbf{s}_{t+1}, a, \phi]]$, which can damage convergence in theory and practice.

Deep Q-Networks (DQN) for ATARI Games

Deep networks naturally handle high-dimensional states — the network takes state as input and predicts action values for all actions simultaneously.

State representation: $220\times160$ ATARI frames (128 colors) → resized to $84\times84$ grayscale → last 4 frames stacked to capture velocity.

Training tricks:

Trick	Description
Reward clipping	Score changes clipped to $\pm1$ — normalizes rewards across 49 games
Experience replay	Tuples $\langle\mathbf{s}_t,a_t,r_{t+1},\mathbf{s}_{t+1}\rangle$ stored in a buffer; random batches sampled — reduces correlation between adjacent frames
Fixed target network	Target parameters $\phi^-$ held fixed and updated only periodically — network no longer chases a moving target

With the fixed target network, the update becomes:

\[\phi \leftarrow \phi + \alpha\!\left(r[\mathbf{s}_t, a_t] + \gamma \cdot \max_a\bigl[q[\mathbf{s}_{t+1}, a, \phi^-]\bigr] - q[\mathbf{s}_t, a_t, \phi]\right)\frac{\partial q[\mathbf{s}_t, a_t, \phi]}{\partial \phi}\]

Atari benchmark: 49 games including Breakout (pictured). a–d) Four adjacent frames form the state. e) Joystick action input. f) 18 possible actions (8 directions + no movement, with/without button).

DQN architecture: 4 stacked 84\times84 grayscale frames → Conv 8\times8 stride 4 → Conv 4\times4 stride 2 → FC → FC → outputs q[\mathbf{s}_t, a_t] for all 18 actions. — DQN architecture: 4 stacked $84\times84$ grayscale frames → Conv $8\times8$ stride 4 → Conv $4\times4$ stride 2 → FC → FC → outputs $q[\mathbf{s}_t, a_t]$ for all 18 actions.

DQN reached professional game tester level on 49 ATARI games (~38 days of experience per game). It struggled on Montezuma’s Revenge — a game with sparse rewards and many visually distinct screens.

Double Q-Learning and Double DQNs

Problem: the $\max$ in the Q-learning update introduces systematic overestimation — stochastic rewards or random network outputs occasionally produce inflated values that get selected by the max, biasing $q[s_t,a_t]$ upward.

The root cause: the same network both selects the best action and evaluates it.

Fix — train two models $q_1$ and $q_2$ simultaneously; each selects the action, the other evaluates it:

\[q_1[s_t,a_t] \leftarrow q_1[s_t,a_t] + \alpha\!\left(r[s_t,a_t] + \gamma\cdot q_2\!\left[s_{t+1},\,\argmax_a\bigl[q_1[s_{t+1},a]\bigr]\right] - q_1[s_t,a_t]\right)\]

\[q_2[s_t,a_t] \leftarrow q_2[s_t,a_t] + \alpha\!\left(r[s_t,a_t] + \gamma\cdot q_1\!\left[s_{t+1},\,\argmax_a\bigl[q_2[s_{t+1},a]\bigr]\right] - q_2[s_t,a_t]\right)\]

New tuples are randomly assigned to update one model or the other. Double DQNs apply this idea with deep networks $q[\mathbf{s}_t,a_t,\phi_1]$ and $q[\mathbf{s}_t,a_t,\phi_2]$:

\[\phi_1 \leftarrow \phi_1 + \alpha\!\left(r[\mathbf{s}_t,a_t] + \gamma\cdot q\!\left[\mathbf{s}_{t+1},\argmax_a\bigl[q[\mathbf{s}_{t+1},a,\phi_1]\bigr],\phi_2\right] - q[\mathbf{s}_t,a_t,\phi_1]\right)\frac{\partial q[\mathbf{s}_t,a_t,\phi_1]}{\partial\phi_1}\]

\[\phi_2 \leftarrow \phi_2 + \alpha\!\left(r[\mathbf{s}_t,a_t] + \gamma\cdot q\!\left[\mathbf{s}_{t+1},\argmax_a\bigl[q[\mathbf{s}_{t+1},a,\phi_2]\bigr],\phi_1\right] - q[\mathbf{s}_t,a_t,\phi_2]\right)\frac{\partial q[\mathbf{s}_t,a_t,\phi_2]}{\partial\phi_2}\]

Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]

Q-learning estimates action values first, then derives a policy. Policy-based methods skip the middleman and directly learn a stochastic policy $\pi[a_t|\mathbf{s}_t, \boldsymbol{\theta}]$ — a parametric function mapping state $\mathbf{s}_t$ to a distribution $Pr(a_t|\mathbf{s}_t)$ over actions.

In MDPs there is always an optimal deterministic policy — yet there are three strong reasons to use a stochastic one:

Exploration

We are not obliged to take the best known action at every step — the policy naturally explores the space.

Smooth loss

The loss changes smoothly as $\boldsymbol{\theta}$ changes → gradient descent works even with discrete rewards (analogous to maximum likelihood in classification).

Partial observability

When the state is not fully observed (see POMDP section), two visually identical observations may require different actions. A stochastic policy can take different actions until the ambiguity is resolved.

Derivation of the Policy Gradient

Consider a trajectory $\boldsymbol{\tau} = [\mathbf{s}_1, a_1, \mathbf{s}_2, a_2, \ldots, \mathbf{s}_T, a_T]$. Its probability under the current policy is:

\[Pr(\boldsymbol{\tau}|\boldsymbol{\theta}) = Pr(\mathbf{s}_1)\prod_{t=1}^{T}\pi[a_t|\mathbf{s}_t,\boldsymbol{\theta}]\,Pr(\mathbf{s}_{t+1}|\mathbf{s}_t,a_t)\]

Objective — maximize expected return over trajectories:

\[\boldsymbol{\theta} = \argmax_{\boldsymbol{\theta}}\Bigl[\mathbb{E}_{\boldsymbol{\tau}}\bigl[r[\boldsymbol{\tau}]\bigr]\Bigr] = \argmax_{\boldsymbol{\theta}}\left[\int Pr(\boldsymbol{\tau}|\boldsymbol{\theta})\,r[\boldsymbol{\tau}]\,d\boldsymbol{\tau}\right]\]

Gradient ascent — multiply and divide by $Pr(\boldsymbol{\tau}|\boldsymbol{\theta})$ to convert to an expectation, then approximate with $I$ observed trajectories:

\[\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha\cdot\int\frac{\partial Pr(\boldsymbol{\tau}|\boldsymbol{\theta})}{\partial\boldsymbol{\theta}}\,r[\boldsymbol{\tau}]\,d\boldsymbol{\tau} \approx \boldsymbol{\theta} + \alpha\cdot\frac{1}{I}\sum_{i=1}^{I}\frac{1}{Pr(\boldsymbol{\tau}_i|\boldsymbol{\theta})}\frac{\partial Pr(\boldsymbol{\tau}_i|\boldsymbol{\theta})}{\partial\boldsymbol{\theta}}\,r[\boldsymbol{\tau}_i]\]

Five episodes for the same policy (brighter = higher reward). Trajectories 1–3 yield high rewards but are already common — small update. Trajectory 4 yields low rewards — policy adjusted to avoid it. Trajectory 5 yields high rewards and is unusual — largest update. — Five episodes for the same policy (brighter = higher reward). Trajectories 1–3 yield high rewards but are already common — small update. Trajectory 4 yields low rewards — policy adjusted to avoid it. Trajectory 5 yields high rewards *and* is unusual — **largest update**.

Likelihood ratio identity $\left(\frac{\partial \log f[z]}{\partial z} = \frac{1}{f[z]}\frac{\partial f[z]}{\partial z}\right)$ simplifies the update to:

\[\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha\cdot\frac{1}{I}\sum_{i=1}^{I}\frac{\partial\log\bigl[Pr(\boldsymbol{\tau}_i|\boldsymbol{\theta})\bigr]}{\partial\boldsymbol{\theta}}\,r[\boldsymbol{\tau}_i]\]

Log-trajectory decomposition — expanding $\log[Pr(\boldsymbol{\tau}|\boldsymbol{\theta})]$:

\[\log[Pr(\boldsymbol{\tau}|\boldsymbol{\theta})] = \log[Pr(\mathbf{s}_1)] + \sum_{t=1}^{T}\log[\pi[a_t|\mathbf{s}_t,\boldsymbol{\theta}]] + \sum_{t=1}^{T}\log[Pr(\mathbf{s}_{t+1}|\mathbf{s}_t,a_t)]\]

Only the middle term depends on $\boldsymbol{\theta}$, giving the REINFORCE update:

\[\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha\cdot\frac{1}{I}\sum_{i=1}^{I}\sum_{t=1}^{T}\frac{\partial\log[\pi[a_{it}|\mathbf{s}_{it},\boldsymbol{\theta}]]}{\partial\boldsymbol{\theta}}\,r[\boldsymbol{\tau}_i]\]

State transition terms $Pr(\mathbf{s}_{t+1}|\mathbf{s}_t,a_t)$ vanish — the update does not require a Markov model.

Further simplification — since past rewards cannot be affected by the action at time $t$, only future rewards matter:

REINFORCE Algorithm

REINFORCE is an early policy gradient algorithm that incorporates discounting. It is a Monte Carlo method that generates episodes $\boldsymbol{\tau}_i = [\mathbf{s}_{i1}, a_{i1}, r_{i2}, \mathbf{s}_{i2}, a_{i2}, r_{i3}, \ldots, r_{iT}]$ under the current policy $\pi[a|\mathbf{s},\boldsymbol{\theta}]$.

For discrete actions, $\pi[\mathbf{s}|\boldsymbol{\theta}]$ is a neural network that takes state $\mathbf{s}$ and outputs one value per action — passed through a softmax to form a distribution, then sampled at each step.

For each episode $i$ and time step $t$, compute the discounted return of the partial trajectory $\boldsymbol{\tau}_{it}$ starting at $t$:

\[r[\boldsymbol{\tau}_{it}] = \sum_{k=t+1}^{T}\gamma^{k-1}\,r_{i,k}\]

Then update parameters for every $(i,t)$:

\[\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha\cdot\frac{\partial\log\bigl[\pi_{a_{it}}[\mathbf{s}_{it},\boldsymbol{\theta}]\bigr]}{\partial\boldsymbol{\theta}}\,r[\boldsymbol{\tau}_{it}] \qquad \forall\, i,t\]

where $\pi_{a_t}[\mathbf{s}_t,\boldsymbol{\theta}]$ is the probability of action $a_t$ produced by the network given state $\mathbf{s}_t$.

Baselines

Policy gradient estimates have high variance — many episodes may be needed for stable gradient estimates. One fix: subtract a baseline $b$ from the returns:

\[\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha\cdot\frac{1}{I}\sum_{i=1}^{I}\sum_{t=1}^{T}\frac{\partial\log[\pi_{a_{it}}[\mathbf{s}_{it},\boldsymbol{\theta}]]}{\partial\boldsymbol{\theta}}\bigl(r[\boldsymbol{\tau}_{it}] - b\bigr)\]

As long as $b$ does not depend on the actions, the expected value is unchanged:

\[\mathbb{E}_{\boldsymbol{\tau}}\!\left[\sum_{t=1}^{T}\frac{\partial\log[\pi_{a_{it}}[\mathbf{s}_{it},\boldsymbol{\theta}]]}{\partial\boldsymbol{\theta}}\cdot b\right] = 0\]

If $b$ co-varies with irrelevant factors that add noise, subtracting it reduces variance without introducing bias (control variates).

a) Estimating \mathbb{E}[a] from few samples has high variance. b) Observe b with \mathbb{E}[b]=0 that co-varies with a. c) Samples of a-b have much lower variance while \mathbb{E}[a-b]=\mathbb{E}[a] — same estimate, lower noise. — a) Estimating $\mathbb{E}[a]$ from few samples has high variance. b) Observe $b$ with $\mathbb{E}[b]=0$ that co-varies with $a$. c) Samples of $a-b$ have much lower variance while $\mathbb{E}[a-b]=\mathbb{E}[a]$ — same estimate, lower noise.

Optimal baseline (minimizes variance):

\[b = \sum_i\frac{\sum_{t=1}^{T}\!\bigl(\partial\log[\pi_{a_{it}}[\mathbf{s}_{it},\boldsymbol{\theta}]]/\partial\boldsymbol{\theta}\bigr)^2 r[\boldsymbol{\tau}_{it}]}{\sum_{t=1}^{T}\!\bigl(\partial\log[\pi_{a_{it}}[\mathbf{s}_{it},\boldsymbol{\theta}]]/\partial\boldsymbol{\theta}\bigr)^2}\]

In practice, this is approximated by the mean return:

\[b \approx \frac{1}{I}\sum_i r[\boldsymbol{\tau}_i]\]

This removes variance that occurs when all trajectories pass through high-return states regardless of the actions taken.

State-Dependent Baselines

A better baseline depends on the current state $\mathbf{s}_{it}$:

This compensates for variance introduced by some states having greater overall returns than others, regardless of the actions taken.

A natural choice is $b[\mathbf{s}] = v[\mathbf{s}]$, the state value. The difference $r[\boldsymbol{\tau}_{it}] - v[\mathbf{s}_{it}]$ is called the advantage estimate — how much better this trajectory was than average from this state.

Parameterize as a neural network $b[\mathbf{s}] = v[\mathbf{s},\boldsymbol{\phi}]$ and fit to observed returns with least squares:

\[L[\boldsymbol{\phi}] = \sum_{i=1}^{I}\sum_{t=1}^{T}\left(v[\mathbf{s}_{it},\boldsymbol{\phi}] - \sum_{j=t}^{T}r_{i,j+1}\right)^2\]

Actor-critic algorithms are TD policy gradient methods — they can update the policy network at each step, without waiting for an episode to complete (unlike Monte Carlo REINFORCE).

In the TD approach we don’t have access to all future rewards $r[\boldsymbol{\tau}_t] = \sum_{k=t}^{T}r_k$. Instead, approximate with the current reward plus the discounted value of the next state:

\[r[\boldsymbol{\tau}_{it}] \approx r_{i,t+1} + \gamma\cdot v[\mathbf{s}_{i,t+1},\boldsymbol{\phi}]\]

where $v[\mathbf{s},\boldsymbol{\phi}]$ is a second neural network (the critic) estimating the state value.

Substituting into the state-dependent baseline update (with $b[\mathbf{s}_{it}] = v[\mathbf{s}_{it},\boldsymbol{\phi}]$) gives the actor update:

Actor — policy network $\pi[\mathbf{s}_t,\boldsymbol{\theta}]$:

\[\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha\cdot\frac{1}{I}\sum_{i=1}^{I}\sum_{t=1}^{T}\frac{\partial\log[Pr(a_{it}|\mathbf{s}_{it},\boldsymbol{\theta})]}{\partial\boldsymbol{\theta}}\Bigl(r_{i,t+1} + \gamma\cdot v[\mathbf{s}_{i,t+1},\boldsymbol{\phi}] - v[\mathbf{s}_{i,t},\boldsymbol{\phi}]\Bigr)\]

Concurrently, update the critic by bootstrapping:

Critic — value network $v[\mathbf{s}_t,\boldsymbol{\phi}]$:

\[L[\boldsymbol{\phi}] = \sum_{i=1}^{I}\sum_{t=1}^{T}\Bigl(r_{i,t+1} + \gamma\cdot v[\mathbf{s}_{i,t+1},\boldsymbol{\phi}] - v[\mathbf{s}_{i,t},\boldsymbol{\phi}]\Bigr)^2\]

Often the same network represents both actor and critic, with two sets of output heads. Although actor-critic methods can update at every step, in practice the agent collects a batch of experience over many time steps before updating.

In standard RL the agent must interact with the environment to learn. This is impractical when:

Dangerous exploration

e.g., autonomous vehicles — erratic behavior during learning is unacceptable.

Expensive data collection

e.g., financial trading — each exploratory action has real cost.

Offline RL (or batch RL) learns to maximize future rewards by observing past sequences $\mathbf{s}_1,a_1,r_2,\mathbf{s}_2,a_2,r_3,\ldots$ collected by human agents — without ever interacting with the environment.

Distinct from imitation learning: offline RL has access to rewards and aims to improve beyond the historical agent, not merely replicate it.

Decision Transformer

Offline RL opens a new framing: treat it as a sequence prediction problem — predict the next action given the history of states, rewards, and actions. The decision transformer exploits a transformer decoder for this.

Key modification — returns-to-go (RTG): replace $r_t$ with $R_{t:T} = \sum_{t'=t}^{T}r_{t'}$ (remaining future rewards), so the model conditions on how much reward is still to come.

Input sequence: triples of (state, action, RTG), each embedded into fixed-size vectors. At each step, the transformer decoder predicts the next action. At test time, start with the desired total return and decrement by observed rewards.

Architecture: states, actions, and RTG are each mapped to fixed-size embeddings (states via CNN for images; actions and RTG via learned embeddings). Trained with masked self-attention and position embeddings.

At inference: specify the desired total return at step 1 and decrement as rewards arrive (e.g., in Atari, set it to the score required to win).

Why this matters: Decision transformers replace most of the RL machinery and its instability with standard supervised learning. Transformers ingest large context windows — making the temporal credit assignment problem more tractable — and can be fine-tuned from online experience to improve over time.