$$ % Define your custom commands here \newcommand{\bmat}[1]{\begin{bmatrix}#1\end{bmatrix}} \newcommand{\E}{\mathbb{E}} \newcommand{\P}{\mathbb{P}} \newcommand{\S}{\mathbb{S}} \newcommand{\R}{\mathbb{R}} \newcommand{\S}{\mathbb{S}} \newcommand{\norm}[2]{\|{#1}\|_{{}_{#2}}} \newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\pdd}[2]{\frac{\partial^2 #1}{\partial #2^2}} \newcommand{\vectornorm}[1]{\left|\left|#1\right|\right|} \newcommand{\abs}[1]{\left|{#1}\right|} \newcommand{\mbf}[1]{\mathbf{#1}} \newcommand{\mc}[1]{\mathcal{#1}} \newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\nicefrac}[2]{{}^{#1}\!/_{\!#2}} \newcommand{\argmin}{\operatorname*{arg\,min}} \newcommand{\argmax}{\operatorname*{arg\,max}} \newcommand{\dd}{\operatorname{d}\!} $$

Diffusion and Flow Models

Credits: Introduction to Flow Matching and Diffusion Models by Peter Holderrieth  

Flow Matching by Mario Gemoll

Credits: Introduction to Flow Matching and Diffusion Models by Peter Holderrieth  

Flow Matching by Mario Gemoll

Credits: Introduction to Flow Matching and Diffusion Models by Peter Holderrieth  

Flow Matching by Mario Gemoll

A new generation of AI systems are “creative” they generate new objects.

Goal of these notes:

  1. Flow and diffusion models from first principles.
  2. The minimal but necessary amount of mathematics for 1.
  3. How to implement and apply these algorithms.

From Generation to Sampling

Images

  • Height \(H\) and Width \(W\)
  • 3 color channels (RGB)

\[\mathbf{z} \in \mathbb{R}^{H \times W \times 3}\]

Videos

  • \(T\) time frames
  • Each frame is an image

\[\mathbf{z} \in \mathbb{R}^{T \times H \times W \times 3}\]

Molecular structures

  • \(N\) atoms
  • Each atom has 3 coordinates

\[\mathbf{z} \in \mathbb{R}^{N \times 3}\]


We represent the objects we want to generate as vectors:

\[\mathbf{z} \in \mathbb{R}^d\]

What does it mean to successfully generate something?

Prompt: “A picture of a dog”

Useless
(Impossible)

Bad
(Rare)

Wrong animal
(Unlikely)

Great!
(Very likely)

Key Insight: How “good” an image is \(\approx\) how “likely” it is under the data distribution.

Generation as Sampling the Data Distribution

We think of the objects we want to generate as following a data distribution \(p_{\text{data}}\). This is a probability density function:

\[p_{\text{data}} : \mathbb{R}^d \to \mathbb{R}_{\ge 0}, \quad \mathbf{z} \mapsto p_{\text{data}}(\mathbf{z})\]

Crucial Note: In practice, we do not know the analytical form of \(p_{\text{data}}\)!

In this framework, generation means sampling from the data distribution:

\(\mathbf{z} \sim p_{\text{data}}\) \(\quad \implies \quad\) \(\mathbf{z} =\)

What consists a Dataset?

Since we don’t know \(p_{\text{data}}\), we rely on a dataset: a finite collection of samples drawn from it.

  • Images: Publicly available images from the internet (e.g., LAION, ImageNet).
  • Videos: Large-scale video repositories (e.g., YouTube).
  • Molecular structures: Scientific repositories (e.g., Protein Data Bank).

Mathematically, a dataset is a collection: \[\{\mathbf{z}_1, \dots, \mathbf{z}_N\} \sim p_{\text{data}}\]

Conditional Generation

Standard (unconditional) generation samples from \(p_{\text{data}}\). However, we often want to condition our generation on a prompt or label \(y\) (e.g., \(y = \text{"Dog"}\)).

Unconditional (\(p_{\text{data}}\))

Fixed prompt “Dog” — diverse samples of the same category:

Conditional (\(p_{\text{data}}(\cdot|y)\))

Changing \(y\) gives different targeted categories:

\(y=\text{"Dog"}\)

\(y=\text{"Cat"}\)

\(y=\text{"Landscape"}\)

Conditional generation means sampling the conditional data distribution: \[ \color{#a71d5d}{\mathbf{z} \sim p_{\text{data}}(\cdot | y)} \]

Tip

We will first focus on unconditional generation and then learn how to translate an unconditional model to a conditional one.

Generative Models as Transformers

A generative model converts samples from a simple initial distribution \(p_{\text{init}}\) into samples from the complex data distribution \(p_{\text{data}}\).

Initial State

\[\mathbf{x} \sim p_{\text{init}}\]

\(\mathcal{N}(0, \mathbf{I}_d)\)

\(\implies\)
Generative Model
\(\implies\)

Target Sample

\[\mathbf{z} \sim p_{\text{data}}\]

(Real Object)

Example: ODE Trajectory


Existence and Uniqueness Theorem ODEs

Theorem (Picard–Lindelöf theorem): If the vector field \(u_t(x)\) is continuously differentiable with bounded derivatives, then a unique solution to the ODE

\[ X_0 = x_0, \quad \frac{\mathrm{d}}{\mathrm{d}t}X_t = u_t(X_t) \]

exists. In other words, a flow map exists. More generally, this is true if the vector field is Lipschitz.

Key takeaway: In the cases of practical interest for machine learning, unique solutions to ODE/flows exist.


Example: Linear ODE

Simple vector field:

\[ u_t(x) = -\theta x \quad (\theta > 0) \]

Claim: Flow is given by

\[ \psi_t(x_0) = \exp(-\theta t) x_0 \]

Proof:

  1. Initial condition: \[ \psi_t(x_0) = \exp(0)x_0 = x_0 \]

  2. ODE: \[ \frac{\mathrm{d}}{\mathrm{d}t} \psi_t(x_0) = \frac{\mathrm{d}}{\mathrm{d}t} \left(\exp(-\theta t) x_0\right) = -\theta \exp(-\theta t) x_0 = -\theta \psi_t(x_0) = u_t(\psi_t(x_0)) \]



Euler Method: Coarse vs Fine Steps

Large step size
more efficient but higher error
Small step size
lower error but less efficient


Toy example

Figure credits: Yaron Lipman


Brownian Motion or the Wiener Process

Brownian motion describes the random motion of particles in fluids or gases. It is basically a random walk. Mathematically it can be modeled as a Wiener process. For our purposes, we can think of this as a path starting at the origin at \(t = 0\), and then proceeding with step size \(h\), adding noise from a standard Gaussian scaled by \(\sqrt{h}\) at each step, until \(t = 1\):

\[ W_{t+h} = W_t + \sqrt{h}\epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0, Id) \quad (t = 0, h, 2h, \dots, 1 - h) \]


Sample 1
Sample 2
Sample 3

Note: A single stochastic process can give rise to many trajectories as the evolution becomes random.

Stochastic Differential Equations (SDEs)

We can add some Brownian motion to the paths taken by particles moving along a vector field described by an ODE, which gives rise to the concept of a stochastic differential equation (SDE):

\[ \begin{align*} \mathrm{d}X_t &= u_t(X_t)\mathrm{d}t + \sigma_t \mathrm{d}W_t \\ X_0 &= x_0 \end{align*} \]

For the details about the \(\mathrm{d}X\) notation, see Holderrieth & Erives, 2025.

The \(\sigma_t\) in the above equation is called the diffusion coefficient and controls the amount of randomness (the ODE term \(u_t(X_t)\) is also called the drift coefficient).

Such an SDE can be approximated by the Euler-Maruyama method (basically the Euler method with some randomness added to it):

\[ x_{t+h} = x_t + h u_t(x_t) + \sqrt{h} \sigma_t \epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0, I_d) \]


\(\sigma = 0.15\)
\(\sigma = 0.44\)
\(\sigma = 0.87\)


Existence and Uniqueness Theorem SDEs

Theorem: If the vector field \(u_t(x)\) is continuously differentiable with bounded derivatives and the diffusion coeff. is continuous, then a unique solution (in distribution) to the SDE

\[ X_0 = x_0, \quad \mathrm{d}X_t = u_t(X_t)\mathrm{d}t + \sigma_t\mathrm{d}W_t \]

exists. More generally, this is true if the vector field is Lipschitz.

Key takeaway: In the cases of practical interest for machine learning, unique solutions to SDEs exist.

Stochastic calculus class: Construct solutions via stochastic integrals and Ito-Riemann sums

Ornstein-Uhlenbeck Process

\[ \dd X_t = -\theta X_t \dd t + \sigma \dd W_t \]


Reminder: Flow and Diffusion Models

Flow
Model

Initialize:

\[X_0 \sim \underbrace{p_{\text{init}}}_{\color{#a30000}{\text{e.g. Gaussian}}}\]

ODE:

\[\mathrm{d}X_t = \underbrace{u_t^{\theta}(X_t)}_{\substack{\color{#a30000}{\text{neural network}} \\ \color{#a30000}{\text{vector field}}}}\mathrm{d}t\]

Diffusion
Model

Initialize:

\[X_0 \sim \underbrace{p_{\text{init}}}_{\color{#a30000}{\text{e.g. Gaussian}}}\]

SDE:

\[\mathrm{d}X_t = \underbrace{u_t^{\theta}(X_t)}_{\substack{\color{#a30000}{\text{neural network}} \\ \color{#a30000}{\text{vector field}}}}\mathrm{d}t + \underbrace{\sigma_t}_{\color{#a30000}{\text{diffusion coeff.}}}\mathrm{d}W_t\]

To get samples, simulate ODE/SDE from \(t=0\) to \(t=1\) and return \(X_1\)

Next Step: Training a Flow Model

Without training, the model produces “non-sense” \(\to\) We need to train \(u_t^\theta\)

Training = Finding parameters \(\theta\) such that

\[\underbrace{X_0 \sim p_{\text{init}}}_{\color{#a30000}{\small\textit{Start with initial distribution}}}\]

\[\underbrace{\mathrm{d}X_t = u_t^{\theta}(X_t)\mathrm{d}t}_{\color{#a30000}{\small\textit{Follow along the vector field}}}\]

\(\Rightarrow\)
Implies

\[\underbrace{X_1 \sim p_{\text{data}}}_{\substack{\color{#a30000}{\small\textit{Distribution of final}} \\ \color{#a30000}{\small\textit{point = data dist.}}}}\]

The Flow Matching Matrix

Conditional
Probability Path

\(\rightarrow\)

Conditional
Vector Field

\(\rightarrow\)

Conditional
Flow Matching Loss

Marginal
Probability Path

\(\rightarrow\)

Marginal
Vector Field

\(\rightarrow\)

Marginal
Flow Matching Loss

“Conditional” = “Per single data point”

“Marginal” = “Across distribution of data points”

Probability Paths: The Path from Noise to Data

Noise

Data

\(t=0\)

\(\longleftarrow\) time \(\longrightarrow\)

\(t=1\)


Conditional Probability Path \(p_t(\cdot | z)\)

\(p_{\text{init}}\)

\(t = 0.00\)

\(t = 0.25\)

\(t = 0.50\)

\(t = 0.75\)

\(t = 1.00\)

\(z\)

t=0

\(\longrightarrow\)

t=1

Samples from a conditional probability path over time

A probability path only specifies the marginals (each snapshot). It says nothing about the evolution of a single particle in time (no dynamics).

Conditional vs. Marginal Probability Path

Conditional Probability Path \(p_t(\cdot | z)\)

\(p_{\text{init}}\)

\(t=0.00\)

\(t=0.25\)

\(t=0.50\)

\(t=0.75\)

\(t=1.00\)

\(z\)

\(p_{\text{init}}\)

\(t=0.00\)

\(t=0.25\)

\(t=0.50\)

\(t=0.75\)

\(t=1.00\)

\(p_{\text{data}}\)

Marginal Probability Path \(p_t\)

Conditional Probability Path

Notation Key property Gaussian example
Conditional Probability Path \(p_t(\cdot\|z)\) Interpolates \(p_{\text{init}}\) and a data point \(z\) \(\mathcal{N}(\alpha_t z,\, \beta_t^2 I_d)\)
Conditional Vector Field \(u_t^c(x,z)\)

Marginal Probability Path

Notation Key property Formula
Marginal Probability Path \(p_t\) Interpolates \(p_{\text{init}}\) and \(p_{\text{data}}\) \(\int p_t(x\|z)\, p_{\text{data}}(z)\,\mathrm{d}z\)
Marginal Vector Field

Example — Conditional Vector Field for Gaussian

\[u_t^{\text{target}}(x|z) = \left(\dot{\alpha}_t - \frac{\dot{\beta}_t}{\beta_t}\alpha_t\right)z + \frac{\dot{\beta}_t}{\beta_t}x\]

Proof Sketch:

Step 1: By checking ODE, show that the flow of the vector field is given by

\[\psi_t^{\text{target}}(x_0|z) = \alpha_t z + \beta_t x_0\]

Step 2: If \(X_0 = x_0 \sim \mathcal{N}(0, I_d)\) is random, then we know that then:

\[X_t = \psi_t(X_0|z) = \alpha_t z + \beta_t X_0 \sim \mathcal{N}(\alpha_t z,\, \beta_t^2 I_d) = p_t(\cdot|z)\]

Gaussian Conditional Probability Path And Conditional Vector Field

Figure credit: Yaron Lipman

Ground truth

ODE samples

ODE Trajectories

\(p_t(\cdot|z)\)

\(p_t\)

Continuity Equation

Randomly initialized ODE

Given: \(\quad X_0 \sim p_{\text{init}}, \qquad \dfrac{\mathrm{d}}{\mathrm{d}t}X_t = u_t(X_t)\)


Follow probability path:

\[X_t \sim p_t \qquad (0 \le t \le 1)\]

Marginals are \(p_t\)

\(\Longleftrightarrow\) equivalent

Continuity equation holds

\[\frac{\mathrm{d}}{\mathrm{d}t}p_t(x) = -\operatorname{div}(p_t u_t)(x)\]

PDE holds

Continuity Equation

\[\frac{\mathrm{d}}{\mathrm{d}t}p_t(x) = -\operatorname{div}(p_t u_t)(x)\]

Change of probability mass at \(x\)

Outflow - inflow of probability mass from \(u\)

Algorithm 3 Flow Matching Training Procedure (General)


Require: A dataset of samples \(z \sim p_{\text{data}}\), neural network \(u_t^\theta\)

1: for each mini-batch of data do
2: \(\quad\) Sample a data example \(z\) from the dataset.
3: \(\quad\) Sample a random time \(t \sim \text{Unif}_{[0,1]}\).
4: \(\quad\) Sample \(x \sim p_t(\cdot|z)\)
5: \(\quad\) Compute loss

\[\mathcal{L}(\theta) = \|u_t^\theta(x) - u_t^{\text{target}}(x|z)\|^2\]

6: \(\quad\) Update the model parameters \(\theta\) via gradient descent on \(\mathcal{L}(\theta)\)
7: end for

Conditional Flow Matching for Gaussian Probability Path

Prob. path

\(\mathcal{N}(\alpha_t z,\, \beta_t^2 I_d)\)

Conditional VF

\(u_t^{\text{target}}(x|z) = \left(\dot{\alpha}_t - \dfrac{\dot{\beta}_t}{\beta_t}\alpha_t\right)z + \dfrac{\dot{\beta}_t}{\beta_t}x\)

Noise Sampling

\(x \sim p_t(\cdot|z) \quad \Leftrightarrow \quad x = \alpha_t z + \beta_t \epsilon, \quad \epsilon \sim \mathcal{N}(0, I_d)\)

Plugging in Noise Sampling into CFM Loss results in:

\[ \begin{align} L_{\text{CFM}}(\theta) &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, x \sim p_t(\cdot|z)} \left[\|u_t^\theta(x) - u_t^{\text{target}}(x|z)\|^2\right] \\ &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, \epsilon \sim \mathcal{N}(0,I_d)} \left[\|u_t^\theta(\alpha_t z + \beta_t \epsilon) - u_t^{\text{target}}(\alpha_t z + \beta_t \epsilon|z)\|^2\right] \\ &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, \epsilon \sim \mathcal{N}(0,I_d)} \left[\|\underbrace{u_t^\theta(\alpha_t z + \beta_t \epsilon)}_{\color{#a30000}{\textbf{noise+data}}} - \underbrace{(\dot{\alpha}_t z + \dot{\beta}_t \epsilon)}_{\color{#a30000}{\textbf{velocity}}}\|^2\right] \end{align} \]

Straight Line Schedule

\[ \begin{align} L_{\text{CFM}}(\theta) &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, \epsilon \sim \mathcal{N}(0,I_d)} \left[\|u_t^\theta(\alpha_t z + \beta_t \epsilon) - (\dot{\alpha}_t z + \dot{\beta}_t \epsilon)\|^2\right] \\ &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, \epsilon \sim \mathcal{N}(0,I_d)} \left[\|\underbrace{u_t^\theta(tz + (1-t)\epsilon)}_{\substack{\color{#a30000}{\textbf{Linear interpolation}} \\ \color{#a30000}{\textbf{of noise and data}}}} - \underbrace{(z - \epsilon)}_{\substack{\color{#a30000}{\textbf{Difference between}} \\ \color{#a30000}{\textbf{noise and data}}}}\|^2\right] \end{align} \]

Figure credit: Yaron Lipman

Algorithm 4 Flow Matching Training for CondOT path


Require: A dataset of samples \(z \sim p_{\text{data}}\), neural network \(u_t^\theta\)

1: for each mini-batch of data do
2: \(\quad\) Sample a data example \(z\) from the dataset.
3: \(\quad\) Sample a random time \(t \sim \text{Unif}_{[0,1]}\).
4: \(\quad\) Sample noise \(\epsilon \sim \mathcal{N}(0, I_d)\)
5: \(\quad\) Set \(x = tz + (1-t)\epsilon\)
6: \(\quad\) Compute loss

\[\mathcal{L}(\theta) = \|u_t^\theta(x) - (z - \epsilon)\|^2\]

7: \(\quad\) Update the model parameters \(\theta\) via gradient descent on \(\mathcal{L}(\theta)\).
8: end for

Example Flow Matching — Stable Diffusion 3

The neural network that generates these images was trained with the algorithm just shown

Reminder: Sampling Algorithm for Flow Model

Algorithm 1 Sampling from a Flow Model with Euler method


Require: Neural network vector field \(u_t^\theta\), number of steps \(n\)

1: Set \(t = 0\)
2: Set step size \(h = \tfrac{1}{n}\)
3: Draw a sample \(X_0 \sim p_{\text{init}}\) Random initialization!
4: for \(i = 1, \dots, n-1\) do
5: \(\quad X_{t+h} = X_t + h u_t^\theta(X_t)\)
6: \(\quad\) Update \(t \leftarrow t + h\)
7: end for
8: return \(X_1\) Return final point

The Flow Matching Matrix

Conditional
Probability Path

\(\rightarrow\)

Conditional
Vector Field

\(\rightarrow\)

Conditional
Flow Matching Loss

Marginal
Probability Path

\(\rightarrow\)

Marginal
Vector Field

\(\rightarrow\)

Marginal
Flow Matching Loss

Defines distributions from noise to data

Defines training target that we want to learn

Loss function that we want to minimize during training

Conditional Probability Path, Vector Field, and Flow Matching Loss

Notation Key property Gaussian example
Conditional Probability Path \(p_t(\cdot\|z)\) Interpolates \(p_{\text{init}}\) and a data point \(z\) \(\mathcal{N}(\alpha_t z,\, \beta_t^1 I_d)\)
Conditional Vector Field \(u_t^{\text{target}}(x\|z)\) ODE follows conditional path \(\left(\dot{\alpha}_t - \dfrac{\dot{\beta}_t}{\beta_t}\alpha_t\right)z + \dfrac{\dot{\beta}_t}{\beta_t}x\)
Conditional FM Loss \(L_{\text{CFM}}(\theta)\) Loss we minimize during training \(\mathbb{E}_{t,z,x}\left[\|u_t^\theta(x) - u_t^{\text{target}}(x\|z)\|^1\right]\)

All these objects are tractable. Just analytical formulas!

Marginal Probability Path, Vector Field, and Flow Matching Loss

Notation Key property Formula
Marginal Probability Path \(p_t\) Interpolates \(p_{\text{init}}\) and \(p_{\text{data}}\) \(\int p_t(x\|z)\, p_{\text{data}}(z)\,\mathrm{d}z\)
Marginal Vector Field \(u_t^{\text{target}}(x)\) ODE follows marginal path \(\int u_t^{\text{target}}(x\|z)\,\dfrac{p_t(x\|z)\,p_{\text{data}}(z)}{p_t(x)}\,\mathrm{d}z\)
Marginal FM Loss \(L_{\text{FM}}(\theta)\) Implicitly minimized via cond FM loss \(\mathbb{E}_{t,z,x}\left[\|u_t^\theta(x) - u_t^{\text{target}}(x)\|^2\right]\)

None of these objects are tractable. But we can still learn them!

Reminder: Conditional Probability Path and Conditional Vector Field

Notation Key property Gaussian example
Conditional Probability Path \(p_t(\cdot\|z)\) Interpolates \(p_{\text{init}}\) and a data point \(z\) \(\mathcal{N}(\alpha_t z,\, \beta_t^2 I_d)\)
Conditional Vector Field \(u_t^{\text{target}}(x\|z)\) ODE follows conditional path \(\left(\dot{\alpha}_t - \dfrac{\dot{\beta}_t}{\beta_t}\alpha_t\right)z + \dfrac{\dot{\beta}_t}{\beta_t}x\)

Reminder: Marginal Probability Path and Marginal Vector Field

Notation Key property Formula
Marginal Probability Path \(p_t\) Interpolates \(p_{\text{init}}\) and \(p_{\text{data}}\) \(\int p_t(x\|z)\, p_{\text{data}}(z)\,\mathrm{d}z\)
Marginal Vector Field \(u_t^{\text{target}}(x)\) ODE follows marginal path \(\int u_t^{\text{target}}(x\|z)\,\dfrac{p_t(x\|z)\,p_{\text{data}}(z)}{p_t(x)}\,\mathrm{d}z\)

Algorithm 3: Flow Matching Training Procedure (General)

Algorithm 3 Flow Matching Training Procedure (General)


Require: A dataset of samples \(z \sim p_{\text{data}}\), neural network \(u_t^\theta\)

1: for each mini-batch of data do
2: \(\quad\) Sample a data example \(z\) from the dataset.
3: \(\quad\) Sample a random time \(t \sim \text{Unif}_{[0,1]}\).
4: \(\quad\) Sample \(x \sim p_t(\cdot|z)\)
5: \(\quad\) Compute loss

\[\mathcal{L}(\theta) = \|u_t^\theta(x) - u_t^{\text{target}}(x|z)\|^2\]

6: \(\quad\) Update the model parameters \(\theta\) via gradient descent on \(\mathcal{L}(\theta)\)
7: end for

We can learn the marginal vector field by approximating the cond. VF for many different data points \(z\).

Reminder: Sampling Algorithm for Flow Model

Algorithm 1 Sampling from a Flow Model with Euler method


Require: Neural network vector field \(u_t^\theta\), number of steps \(n\)

1: Set \(t = 0\)
2: Set step size \(h = \tfrac{1}{n}\)
3: Draw a sample \(X_0 \sim p_{\text{init}}\) Random initialization!
4: for \(i = 1, \dots, n-1\) do
5: \(\quad X_{t+h} = X_t + h u_t^\theta(X_t)\)
6: \(\quad\) Update \(t \leftarrow t + h\)
7: end for
8: return \(X_1\) Return final point

Score Functions = Gradients of the log-likelihood

Log-likelihood: \(\log q(x)\)

Score function: \(\nabla \log q(x)\)

Example — Score of Gaussian Probability Path

\[\nabla \log p_t(x|z) = -\frac{1}{\beta_t^2}x + \frac{\alpha_t}{\beta_t^2}z\]

Proof:

\[p_t(x|z) = \mathcal{N}(x;\, \alpha_t z, \beta_t^2 I_d) = \frac{1}{(2\pi)^{d/2}\beta_t^d} \exp\!\left(-\frac{1}{2\beta_t^2}\|x - \alpha_t z\|^2\right)\]

\[\log p_t(x|z) = \log \mathcal{N}(x;\, \alpha_t z, \beta_t^2 I_d) = -\frac{d}{2}\log(2\pi) - d\log\beta_t - \frac{1}{2\beta_t^2}\|x - \alpha_t z\|^2\]

\[\nabla \log p_t(x|z) = \nabla \log \mathcal{N}(x;\, \alpha_t z, \beta_t^2 I_d) = -\frac{x - \alpha_t z}{\beta_t^2}\]

Conditional Probability Path, Vector Field, and Score

Notation Key property Gaussian example
Conditional Probability Path \(p_t(\cdot\|z)\) Interpolates \(p_{\text{init}}\) and a data point \(z\) \(\mathcal{N}(\alpha_t z,\, \beta_t^2 I_d)\)
Conditional Vector Field \(u_t^{\text{target}}(x\|z)\) ODE follows conditional path \(\left(\dot{\alpha}_t - \dfrac{\dot{\beta}_t}{\beta_t}\alpha_t\right)z + \dfrac{\dot{\beta}_t}{\beta_t}x\)
Conditional Score Function \(\nabla \log p_t(x\|z)\) Gradient of log-likelihood \(\dfrac{\alpha_t}{\beta_t^2}z - \dfrac{1}{\beta_t^2}x\)

Marginal Probability Path, Vector Field, and Score

Notation Key property Formula
Marginal Probability Path \(p_t\) Interpolates \(p_{\text{init}}\) and \(p_{\text{data}}\) \(\int p_t(x\|z)\, p_{\text{data}}(z)\,\mathrm{d}z\)
Marginal Vector Field \(u_t^{\text{target}}(x)\) ODE follows marginal path \(\int u_t^{\text{target}}(x\|z)\,\dfrac{p_t(x\|z)\,p_{\text{data}}(z)}{p_t(x)}\,\mathrm{d}z\)
Marginal Score Function \(\nabla \log p_t(x)\) Can be used to convert ODE target to SDE \(\int \nabla \log p_t(x\|z)\,\dfrac{p_t(x\|z)\,p_{\text{data}}(z)}{p_t(x)}\,\mathrm{d}z\)

Observation: Both Conditional Vector Field and Conditional Score are Linear Functions! Just with Different Coefficients!

Notation Key property Gaussian example
Conditional Vector Field \(u_t^{\text{target}}(x\|z)\) ODE follows conditional path \(\left(\dot{\alpha}_t - \dfrac{\dot{\beta}_t}{\beta_t}\alpha_t\right)z + \dfrac{\dot{\beta}_t}{\beta_t}x\)
Conditional Score Function \(\nabla \log p_t(x\|z)\) Gradient of log-likelihood \(\dfrac{\alpha_t}{\beta_t^2}z - \dfrac{1}{\beta_t^2}x\)

Reparameterization: Velocity Field → Score Function

\[a_t = \left(\beta_t^2 \frac{\dot{\alpha}_t}{\alpha_t} - \dot{\beta}_t \beta_t\right), \qquad b_t = \frac{\dot{\alpha}_t}{\alpha_t}\]

\[ \begin{align} u_t^{\text{target}}(x|z) &= a_t \nabla \log p_t(x|z) + b_t x \\ u_t^{\text{target}}(x) &= a_t \nabla \log p_t(x) + b_t x \end{align} \]

Algorithm 6 Score Matching Training Procedure (General)


Require: A dataset of samples \(z \sim p_{\text{data}}\), score network \(s_t^\theta\)

1: for each mini-batch of data do
2: \(\quad\) Sample a data example \(z\) from the dataset.
3: \(\quad\) Sample a random time \(t \sim \text{Unif}_{[0,1]}\).
4: \(\quad\) Sample \(x \sim p_t(\cdot|z)\)
5: \(\quad\) Compute loss

\[\mathcal{L}(\theta) = \|s_t^\theta(x) - \nabla \log p_t(x|z)\|^2\]

6: \(\quad\) Update the model parameters \(\theta\) via gradient descent on \(\mathcal{L}(\theta)\)
7: end for

Denoising Score Matching for Gaussian Prob. Path

\[\nabla \log p_t(x|z) = -\frac{x - \alpha_t z}{\beta_t^2}\]

\[\epsilon \sim \mathcal{N}(0, I_d) \quad \Rightarrow \quad x = \alpha_t z + \beta_t \epsilon \sim \mathcal{N}(\alpha_t z,\, \beta_t^2 I_d)\]

\[ \begin{align} \mathcal{L}_{\text{dsm}}(\theta) &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, x \sim p_t(\cdot|z)}\!\left[\left\|s_t^\theta(x) + \frac{x - \alpha_t z}{\beta_t^2}\right\|^2\right] \\ &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, \epsilon \sim \mathcal{N}(0,I_d)}\!\left[\left\|s_t^\theta(\alpha_t z + \beta_t \epsilon) + \frac{\epsilon}{\beta_t}\right\|^2\right] \end{align} \]

Note what the network does: It needs to predict the noise that was used to corrupt the data point! (DENOISING diffusion models)

Algorithm 5 Score Matching Training Procedure for Gaussian probability path


Require: A dataset of samples \(z \sim p_{\text{data}}\), score network \(s_t^\theta\) or noise predictor \(\epsilon_t^\theta\)
Require: Schedulers \(\alpha_t, \beta_t\) with \(\alpha_0 = \beta_1 = 0,\, \alpha_1 = \beta_0 = 1\)

1: for each mini-batch of data do
2: \(\quad\) Sample a data example \(z\) from the dataset.
3: \(\quad\) Sample a random time \(t \sim \text{Unif}_{[0,1]}\).
4: \(\quad\) Sample noise \(\epsilon \sim \mathcal{N}(0, I_d)\)
5: \(\quad\) Set \(x_t = \alpha_t z + \beta_t \epsilon\)
6: \(\quad\) Compute loss

\[\mathcal{L}(\theta) = \left\|s_t^\theta(x_t) + \frac{\epsilon}{\beta_t}\right\|^2\]

Numerically unstable for low beta!

7: \(\quad\) Update the model parameters \(\theta\) via gradient descent on \(\mathcal{L}(\theta)\).
8: end for

Fokker-Planck Equation

Randomly initialized SDE

Given: \(\quad X_0 \sim p_{\text{init}}, \qquad \mathrm{d}X_t = u_t(X_t)\mathrm{d}t + \sigma_t\mathrm{d}W_t\)


Follow probability path:

\[X_t \sim p_t \qquad (0 \le t \le 1)\]

Marginals are \(p_t\)

\(\Longleftrightarrow\) equivalent

Fokker-Planck equation holds

\[\frac{\mathrm{d}}{\mathrm{d}t}p_t(x) = \underbrace{-\operatorname{div}(p_t u_t)(x)}_{\color{#a30000}{\small\textit{Continuity equation}}} + \underbrace{\frac{\sigma_t^2}{2}\Delta p_t(x)}_{\color{#a30000}{\small\textit{Heat equation}}}\]

Fokker-Planck Equation

\[\underbrace{\frac{\mathrm{d}}{\mathrm{d}t}p_t(x)}_{\color{#a30000}{\small\textit{Change of prob. mass at } x}} = \underbrace{-\operatorname{div}(p_t u_t)(x)}_{\color{#a30000}{\small\textit{Mass conservation}}} + \underbrace{\frac{\sigma_t^2}{2}\Delta p_t(x)}_{\color{#a30000}{\small\textit{Heat dispersion}}}\]

Stochastic Sampling of Diffusion Models

Choose noise level \(\sigma_t\). By “SDE extension trick”, we can sample from:

\[\mathrm{d}X_t = \left[{\color{#1a6faf}{u_t^{\text{target}}(X_t)}} + {\color{#2a9a2a}{\frac{\sigma_t^2}{2}\nabla \log p_t(X_t)}}\right]\mathrm{d}t + \sigma_t\mathrm{d}W_t\]

For Gaussian probability paths, we can express this solely in terms of the score:

\[\mathrm{d}X_t = \left[\left(a_t + \frac{\sigma_t^2}{2}\right)\nabla \log p_t(X_t) + b_t X_t\right]\mathrm{d}t + \sigma_t\mathrm{d}W_t\]

Plugin score network:

\[\mathrm{d}X_t = \left[\left(a_t + \frac{\sigma_t^2}{2}\right)s_t^\theta(X_t) + b_t X_t\right]\mathrm{d}t + \sigma_t\mathrm{d}W_t\]

Why Would We Want Stochastic/SDE Dynamics?

In theory: All diffusion coefficients lead to the same result (sample from data distribution).

In practice:

  • Training error: Neural network has not perfectly learnt the marginal vector field/score.
  • Simulation error: We need to simulate SDE/ODE leading to discretization error.

Downstream applications: Fine-tuning, inference-time optimization, etc. might require stochastic evolution

Good news: ODE sampling often leads to the best results. Therefore, SDE sampling is an option, not a must!

Image source: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis [1]

A swamp ogre with a pearl earring by Johannes Vermeer

A car made out of vegetables.

heat death of the universe, line art

Unguided: Generate an image.

Guided: Generate an image of a cat baking a cake.

Vanilla Guided Sampling

NoteAlgorithm 7 Guided Sampling Procedure

Require: A trained guided vector field \(u_t^\theta(x|y)\).

  1. Select a prompt \(y \in \mathcal{Y}\), such as “a cat baking a cake”.
  2. Initialize \(X_0 \sim p_\mathrm{init}\).
  3. Simulate \(\mathrm{d}X_t = u_t^\theta(X_t|y)\mathrm{d}t\) from \(t = 0\) to \(t = 1\).

Vanilla Guidance leads to suboptimal results

Prompt: “Corgi dog”

These images do not fit well to the prompt and they have errors!

Intuition: Classifier Guidance

Classifier-Free Guidance

Classifier-free guidance training: Account for empty token \(\varnothing\)

NoteAlgorithm 5 Classifier-free guidance training

Require: Paired dataset \((z, y) \sim p_\mathrm{data}\), neural network \(u_t^\theta\)

  1. for each mini-batch of data do
  2.     Sample a data example \((z, y)\) from the dataset.
  3.     Sample a random time \(t \sim \mathrm{Unif}_{[0,1]}\).
  4.     Sample noise \(\epsilon \sim \mathcal{N}(0, I_d)\)
  5.     Set \(x = \alpha_t z + \beta_t \epsilon\)
  6.     With probability \(p\) drop label: \(y \leftarrow \varnothing\)      Drop label with a certain probability!
  7.     Compute loss

\[\mathcal{L}(\theta) = \|u_t^\theta(x|y) - u_t^\mathrm{target}(x|z)\|^2\]

  1.     Update the model parameters \(\theta\) via gradient descent on \(\mathcal{L}(\theta)\).
  2. end for

Sampling with Classifier-Free Guidance

Simply is the same as before but we use the weighted vector field:

\[u_t^{\theta,w}(x) = (1-w)u_t^\theta(x|\varnothing) + wu_t^\theta(x|y)\]

NoteAlgorithm 8 Classifier-Free Guidance Sampling Procedure

Require: A trained guided vector field \(u_t^\theta(x|y)\).

  1. Select a prompt \(y \in \mathcal{Y}\), or take \(y = \varnothing\) for unguided sampling.
  2. Select a guidance scale \(w > 1\).
  3. Initialize \(X_0 \sim p_\mathrm{init}\).
  4. Simulate \(\mathrm{d}X_t = \left[(1-w)u_t^\theta(X_t|\varnothing) + wu_t^\theta(X_t|y)\right]\mathrm{d}t\) from \(t=0\) to \(t=1\).

Example: Classifier-Free Guidance

w=1.0

w=4.0

Example: Classifier-Free Guidance