$$ % Define your custom commands here \newcommand{\bmat}[1]{\begin{bmatrix}#1\end{bmatrix}} \newcommand{\E}{\mathbb{E}} \newcommand{\P}{\mathbb{P}} \newcommand{\S}{\mathbb{S}} \newcommand{\R}{\mathbb{R}} \newcommand{\S}{\mathbb{S}} \newcommand{\norm}[2]{\|{#1}\|_{{}_{#2}}} \newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\pdd}[2]{\frac{\partial^2 #1}{\partial #2^2}} \newcommand{\vectornorm}[1]{\left|\left|#1\right|\right|} \newcommand{\abs}[1]{\left|{#1}\right|} \newcommand{\mbf}[1]{\mathbf{#1}} \newcommand{\mc}[1]{\mathcal{#1}} \newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\nicefrac}[2]{{}^{#1}\!/_{\!#2}} \newcommand{\argmin}{\operatorname*{arg\,min}} \newcommand{\argmax}{\operatorname*{arg\,max}} \newcommand{\dd}{\operatorname{d}\!} $$

Diffusion and Flow Models

Credits: Introduction to Flow Matching and Diffusion Models by Peter Holderrieth  

Flow Matching by Mario Gemoll

Credits: Introduction to Flow Matching and Diffusion Models by Peter Holderrieth  

Flow Matching by Mario Gemoll

Credits: Introduction to Flow Matching and Diffusion Models by Peter Holderrieth  

Flow Matching by Mario Gemoll

A new generation of AI systems are “creative” they generate new objects.

Goal of these notes:

  1. Flow and diffusion models from first principles.
  2. The minimal but necessary amount of mathematics for 1.
  3. How to implement and apply these algorithms.

From Generation to Sampling

Images

  • Height \(H\) and Width \(W\)
  • 3 color channels (RGB)

\[\mathbf{z} \in \mathbb{R}^{H \times W \times 3}\]

Videos

  • \(T\) time frames
  • Each frame is an image

\[\mathbf{z} \in \mathbb{R}^{T \times H \times W \times 3}\]

Molecular structures

  • \(N\) atoms
  • Each atom has 3 coordinates

\[\mathbf{z} \in \mathbb{R}^{N \times 3}\]


We represent the objects we want to generate as vectors:

\[\mathbf{z} \in \mathbb{R}^d\]

What does it mean to successfully generate something?

Prompt: “A picture of a dog”

Useless
(Impossible)

Bad
(Rare)

Wrong animal
(Unlikely)

Great!
(Very likely)

Key Insight: How “good” an image is \(\approx\) how “likely” it is under the data distribution.

Generation as Sampling the Data Distribution

We think of the objects we want to generate as following a data distribution \(p_{\text{data}}\). This is a probability density function:

\[p_{\text{data}} : \mathbb{R}^d \to \mathbb{R}_{\ge 0}, \quad \mathbf{z} \mapsto p_{\text{data}}(\mathbf{z})\]

Crucial Note: In practice, we do not know the analytical form of \(p_{\text{data}}\)!

In this framework, generation means sampling from the data distribution:

\(\mathbf{z} \sim p_{\text{data}}\) \(\quad \implies \quad\) \(\mathbf{z} =\)

What consists a Dataset?

Since we don’t know \(p_{\text{data}}\), we rely on a dataset: a finite collection of samples drawn from it.

  • Images: Publicly available images from the internet (e.g., LAION, ImageNet).
  • Videos: Large-scale video repositories (e.g., YouTube).
  • Molecular structures: Scientific repositories (e.g., Protein Data Bank).

Mathematically, a dataset is a collection: \[\{\mathbf{z}_1, \dots, \mathbf{z}_N\} \sim p_{\text{data}}\]

Conditional Generation

Standard (unconditional) generation samples from \(p_{\text{data}}\). However, we often want to condition our generation on a prompt or label \(y\) (e.g., \(y = \text{"Dog"}\)).

Unconditional (\(p_{\text{data}}\))

Fixed prompt “Dog” — diverse samples of the same category:

Conditional (\(p_{\text{data}}(\cdot|y)\))

Changing \(y\) gives different targeted categories:

\(y=\text{"Dog"}\)

\(y=\text{"Cat"}\)

\(y=\text{"Landscape"}\)

Conditional generation means sampling the conditional data distribution: \[ \color{#a71d5d}{\mathbf{z} \sim p_{\text{data}}(\cdot | y)} \]

Tip

We will first focus on unconditional generation and then learn how to translate an unconditional model to a conditional one.

Generative Models as Transformers

A generative model converts samples from a simple initial distribution \(p_{\text{init}}\) into samples from the complex data distribution \(p_{\text{data}}\).

Initial State

\[\mathbf{x} \sim p_{\text{init}}\]

\(\mathcal{N}(0, \mathbf{I}_d)\)

\(\implies\)
Generative Model
\(\implies\)

Target Sample

\[\mathbf{z} \sim p_{\text{data}}\]

(Real Object)

Example: ODE Trajectory


Existence and Uniqueness Theorem ODEs

Theorem (Picard–Lindelöf theorem): If the vector field \(u_t(x)\) is continuously differentiable with bounded derivatives, then a unique solution to the ODE

\[ X_0 = x_0, \quad \frac{\mathrm{d}}{\mathrm{d}t}X_t = u_t(X_t) \]

exists. In other words, a flow map exists. More generally, this is true if the vector field is Lipschitz.

Key takeaway: In the cases of practical interest for machine learning, unique solutions to ODE/flows exist.


Example: Linear ODE

Simple vector field:

\[ u_t(x) = -\theta x \quad (\theta > 0) \]

Claim: Flow is given by

\[ \psi_t(x_0) = \exp(-\theta t) x_0 \]

Proof:

  1. Initial condition: \[ \psi_t(x_0) = \exp(0)x_0 = x_0 \]

  2. ODE: \[ \frac{\mathrm{d}}{\mathrm{d}t} \psi_t(x_0) = \frac{\mathrm{d}}{\mathrm{d}t} \left(\exp(-\theta t) x_0\right) = -\theta \exp(-\theta t) x_0 = -\theta \psi_t(x_0) = u_t(\psi_t(x_0)) \]



Euler Method: Coarse vs Fine Steps

Large step size
more efficient but higher error
Small step size
lower error but less efficient


Toy example

Figure credits: Yaron Lipman


Brownian Motion or the Wiener Process

Brownian motion describes the random motion of particles in fluids or gases. It is basically a random walk. Mathematically it can be modeled as a Wiener process. For our purposes, we can think of this as a path starting at the origin at \(t = 0\), and then proceeding with step size \(h\), adding noise from a standard Gaussian scaled by \(\sqrt{h}\) at each step, until \(t = 1\):

\[ W_{t+h} = W_t + \sqrt{h}\epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0, Id) \quad (t = 0, h, 2h, \dots, 1 - h) \]


Sample 1
Sample 2
Sample 3

Note: A single stochastic process can give rise to many trajectories as the evolution becomes random.

Stochastic Differential Equations (SDEs)

We can add some Brownian motion to the paths taken by particles moving along a vector field described by an ODE, which gives rise to the concept of a stochastic differential equation (SDE):

\[ \begin{align*} \mathrm{d}X_t &= u_t(X_t)\mathrm{d}t + \sigma_t \mathrm{d}W_t \\ X_0 &= x_0 \end{align*} \]

For the details about the \(\mathrm{d}X\) notation, see Holderrieth & Erives, 2025.

The \(\sigma_t\) in the above equation is called the diffusion coefficient and controls the amount of randomness (the ODE term \(u_t(X_t)\) is also called the drift coefficient).

Such an SDE can be approximated by the Euler-Maruyama method (basically the Euler method with some randomness added to it):

\[ x_{t+h} = x_t + h u_t(x_t) + \sqrt{h} \sigma_t \epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0, I_d) \]


\(\sigma = 0.15\)
\(\sigma = 0.44\)
\(\sigma = 0.87\)


Existence and Uniqueness Theorem SDEs

Theorem: If the vector field \(u_t(x)\) is continuously differentiable with bounded derivatives and the diffusion coeff. is continuous, then a unique solution (in distribution) to the SDE

\[ X_0 = x_0, \quad \mathrm{d}X_t = u_t(X_t)\mathrm{d}t + \sigma_t\mathrm{d}W_t \]

exists. More generally, this is true if the vector field is Lipschitz.

Key takeaway: In the cases of practical interest for machine learning, unique solutions to SDEs exist.

Stochastic calculus class: Construct solutions via stochastic integrals and Ito-Riemann sums

Ornstein-Uhlenbeck Process

\[ \dd X_t = -\theta X_t \dd t + \sigma \dd W_t \]


Reminder: Flow and Diffusion Models

Flow
Model

Initialize:

\[X_0 \sim \underbrace{p_{\text{init}}}_{\color{#a30000}{\text{e.g. Gaussian}}}\]

ODE:

\[\mathrm{d}X_t = \underbrace{u_t^{\theta}(X_t)}_{\substack{\color{#a30000}{\text{neural network}} \\ \color{#a30000}{\text{vector field}}}}\mathrm{d}t\]

Diffusion
Model

Initialize:

\[X_0 \sim \underbrace{p_{\text{init}}}_{\color{#a30000}{\text{e.g. Gaussian}}}\]

SDE:

\[\mathrm{d}X_t = \underbrace{u_t^{\theta}(X_t)}_{\substack{\color{#a30000}{\text{neural network}} \\ \color{#a30000}{\text{vector field}}}}\mathrm{d}t + \underbrace{\sigma_t}_{\color{#a30000}{\text{diffusion coeff.}}}\mathrm{d}W_t\]

To get samples, simulate ODE/SDE from \(t=0\) to \(t=1\) and return \(X_1\)

Next Step: Training a Flow Model

Without training, the model produces “non-sense” \(\to\) We need to train \(u_t^\theta\)

Training = Finding parameters \(\theta\) such that

\[\underbrace{X_0 \sim p_{\text{init}}}_{\color{#a30000}{\small\textit{Start with initial distribution}}}\]

\[\underbrace{\mathrm{d}X_t = u_t^{\theta}(X_t)\mathrm{d}t}_{\color{#a30000}{\small\textit{Follow along the vector field}}}\]

\(\Rightarrow\)
Implies

\[\underbrace{X_1 \sim p_{\text{data}}}_{\substack{\color{#a30000}{\small\textit{Distribution of final}} \\ \color{#a30000}{\small\textit{point = data dist.}}}}\]

The Flow Matching Matrix

Conditional
Probability Path

\(\rightarrow\)

Conditional
Vector Field

\(\rightarrow\)

Conditional
Flow Matching Loss

Marginal
Probability Path

\(\rightarrow\)

Marginal
Vector Field

\(\rightarrow\)

Marginal
Flow Matching Loss

“Conditional” = “Per single data point”

“Marginal” = “Across distribution of data points”

Probability Paths: The Path from Noise to Data

Noise

Data

\(t=0\)

\(\longleftarrow\) time \(\longrightarrow\)

\(t=1\)


Conditional Probability Path \(p_t(\cdot | z)\)

\(p_{\text{init}}\)

\(t = 0.00\)

\(t = 0.25\)

\(t = 0.50\)

\(t = 0.75\)

\(t = 1.00\)

\(z\)

t=0

\(\longrightarrow\)

t=1

Samples from a conditional probability path over time

A probability path only specifies the marginals (each snapshot). It says nothing about the evolution of a single particle in time (no dynamics).

Conditional vs. Marginal Probability Path

Conditional Probability Path \(p_t(\cdot | z)\)

\(p_{\text{init}}\)

\(t=0.00\)

\(t=0.25\)

\(t=0.50\)

\(t=0.75\)

\(t=1.00\)

\(z\)

\(p_{\text{init}}\)

\(t=0.00\)

\(t=0.25\)

\(t=0.50\)

\(t=0.75\)

\(t=1.00\)

\(p_{\text{data}}\)

Marginal Probability Path \(p_t\)

Conditional Probability Path

Notation Key property Gaussian example
Conditional Probability Path \(p_t(\cdot\|z)\) Interpolates \(p_{\text{init}}\) and a data point \(z\) \(\mathcal{N}(\alpha_t z,\, \beta_t^2 I_d)\)
Conditional Vector Field \(u_t^c(x,z)\)

Marginal Probability Path

Notation Key property Formula
Marginal Probability Path \(p_t\) Interpolates \(p_{\text{init}}\) and \(p_{\text{data}}\) \(\int p_t(x\|z)\, p_{\text{data}}(z)\,\mathrm{d}z\)
Marginal Vector Field

Example — Conditional Vector Field for Gaussian

\[u_t^{\text{target}}(x|z) = \left(\dot{\alpha}_t - \frac{\dot{\beta}_t}{\beta_t}\alpha_t\right)z + \frac{\dot{\beta}_t}{\beta_t}x\]

Proof Sketch:

Step 1: By checking ODE, show that the flow of the vector field is given by

\[\psi_t^{\text{target}}(x_0|z) = \alpha_t z + \beta_t x_0\]

Step 2: If \(X_0 = x_0 \sim \mathcal{N}(0, I_d)\) is random, then we know that then:

\[X_t = \psi_t(X_0|z) = \alpha_t z + \beta_t X_0 \sim \mathcal{N}(\alpha_t z,\, \beta_t^2 I_d) = p_t(\cdot|z)\]

Gaussian Conditional Probability Path And Conditional Vector Field

Figure credit: Yaron Lipman

Ground truth

ODE samples

ODE Trajectories

\(p_t(\cdot|z)\)

\(p_t\)

Continuity Equation

Randomly initialized ODE

Given: \(\quad X_0 \sim p_{\text{init}}, \qquad \dfrac{\mathrm{d}}{\mathrm{d}t}X_t = u_t(X_t)\)


Follow probability path:

\[X_t \sim p_t \qquad (0 \le t \le 1)\]

Marginals are \(p_t\)

\(\Longleftrightarrow\) equivalent

Continuity equation holds

\[\frac{\mathrm{d}}{\mathrm{d}t}p_t(x) = -\operatorname{div}(p_t u_t)(x)\]

PDE holds

Continuity Equation

\[\frac{\mathrm{d}}{\mathrm{d}t}p_t(x) = -\operatorname{div}(p_t u_t)(x)\]

Change of probability mass at \(x\)

Outflow - inflow of probability mass from \(u\)

Algorithm 3 Flow Matching Training Procedure (General)


Require: A dataset of samples \(z \sim p_{\text{data}}\), neural network \(u_t^\theta\)

1: for each mini-batch of data do
2: \(\quad\) Sample a data example \(z\) from the dataset.
3: \(\quad\) Sample a random time \(t \sim \text{Unif}_{[0,1]}\).
4: \(\quad\) Sample \(x \sim p_t(\cdot|z)\)
5: \(\quad\) Compute loss

\[\mathcal{L}(\theta) = \|u_t^\theta(x) - u_t^{\text{target}}(x|z)\|^2\]

6: \(\quad\) Update the model parameters \(\theta\) via gradient descent on \(\mathcal{L}(\theta)\)
7: end for

Conditional Flow Matching for Gaussian Probability Path

Prob. path

\(\mathcal{N}(\alpha_t z,\, \beta_t^2 I_d)\)

Conditional VF

\(u_t^{\text{target}}(x|z) = \left(\dot{\alpha}_t - \dfrac{\dot{\beta}_t}{\beta_t}\alpha_t\right)z + \dfrac{\dot{\beta}_t}{\beta_t}x\)

Noise Sampling

\(x \sim p_t(\cdot|z) \quad \Leftrightarrow \quad x = \alpha_t z + \beta_t \epsilon, \quad \epsilon \sim \mathcal{N}(0, I_d)\)

Plugging in Noise Sampling into CFM Loss results in:

\[ \begin{align} L_{\text{CFM}}(\theta) &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, x \sim p_t(\cdot|z)} \left[\|u_t^\theta(x) - u_t^{\text{target}}(x|z)\|^2\right] \\ &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, \epsilon \sim \mathcal{N}(0,I_d)} \left[\|u_t^\theta(\alpha_t z + \beta_t \epsilon) - u_t^{\text{target}}(\alpha_t z + \beta_t \epsilon|z)\|^2\right] \\ &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, \epsilon \sim \mathcal{N}(0,I_d)} \left[\|\underbrace{u_t^\theta(\alpha_t z + \beta_t \epsilon)}_{\color{#a30000}{\textbf{noise+data}}} - \underbrace{(\dot{\alpha}_t z + \dot{\beta}_t \epsilon)}_{\color{#a30000}{\textbf{velocity}}}\|^2\right] \end{align} \]

Straight Line Schedule

\[ \begin{align} L_{\text{CFM}}(\theta) &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, \epsilon \sim \mathcal{N}(0,I_d)} \left[\|u_t^\theta(\alpha_t z + \beta_t \epsilon) - (\dot{\alpha}_t z + \dot{\beta}_t \epsilon)\|^2\right] \\ &= \mathbb{E}_{t \sim \text{Unif},\, z \sim p_{\text{data}},\, \epsilon \sim \mathcal{N}(0,I_d)} \left[\|\underbrace{u_t^\theta(tz + (1-t)\epsilon)}_{\substack{\color{#a30000}{\textbf{Linear interpolation}} \\ \color{#a30000}{\textbf{of noise and data}}}} - \underbrace{(z - \epsilon)}_{\substack{\color{#a30000}{\textbf{Difference between}} \\ \color{#a30000}{\textbf{noise and data}}}}\|^2\right] \end{align} \]

Figure credit: Yaron Lipman

Algorithm 4 Flow Matching Training for CondOT path


Require: A dataset of samples \(z \sim p_{\text{data}}\), neural network \(u_t^\theta\)

1: for each mini-batch of data do
2: \(\quad\) Sample a data example \(z\) from the dataset.
3: \(\quad\) Sample a random time \(t \sim \text{Unif}_{[0,1]}\).
4: \(\quad\) Sample noise \(\epsilon \sim \mathcal{N}(0, I_d)\)
5: \(\quad\) Set \(x = tz + (1-t)\epsilon\)
6: \(\quad\) Compute loss

\[\mathcal{L}(\theta) = \|u_t^\theta(x) - (z - \epsilon)\|^2\]

7: \(\quad\) Update the model parameters \(\theta\) via gradient descent on \(\mathcal{L}(\theta)\).
8: end for

Example Flow Matching — Stable Diffusion 3

The neural network that generates these images was trained with the algorithm just shown

Reminder: Sampling Algorithm for Flow Model

Algorithm 1 Sampling from a Flow Model with Euler method


Require: Neural network vector field \(u_t^\theta\), number of steps \(n\)

1: Set \(t = 0\)
2: Set step size \(h = \tfrac{1}{n}\)
3: Draw a sample \(X_0 \sim p_{\text{init}}\) Random initialization!
4: for \(i = 1, \dots, n-1\) do
5: \(\quad X_{t+h} = X_t + h u_t^\theta(X_t)\)
6: \(\quad\) Update \(t \leftarrow t + h\)
7: end for
8: return \(X_1\) Return final point

The Flow Matching Matrix

Conditional
Probability Path

\(\rightarrow\)

Conditional
Vector Field

\(\rightarrow\)

Conditional
Flow Matching Loss

Marginal
Probability Path

\(\rightarrow\)

Marginal
Vector Field

\(\rightarrow\)

Marginal
Flow Matching Loss

Defines distributions from noise to data

Defines training target that we want to learn

Loss function that we want to minimize during training

Conditional Probability Path, Vector Field, and Flow Matching Loss

Notation Key property Gaussian example
Conditional Probability Path \(p_t(\cdot\|z)\) Interpolates \(p_{\text{init}}\) and a data point \(z\) \(\mathcal{N}(\alpha_t z,\, \beta_t^1 I_d)\)
Conditional Vector Field \(u_t^{\text{target}}(x\|z)\) ODE follows conditional path \(\left(\dot{\alpha}_t - \dfrac{\dot{\beta}_t}{\beta_t}\alpha_t\right)z + \dfrac{\dot{\beta}_t}{\beta_t}x\)
Conditional FM Loss \(L_{\text{CFM}}(\theta)\) Loss we minimize during training \(\mathbb{E}_{t,z,x}\left[\|u_t^\theta(x) - u_t^{\text{target}}(x\|z)\|^1\right]\)

All these objects are tractable. Just analytical formulas!

Marginal Probability Path, Vector Field, and Flow Matching Loss

Notation Key property Formula
Marginal Probability Path \(p_t\) Interpolates \(p_{\text{init}}\) and \(p_{\text{data}}\) \(\int p_t(x\|z)\, p_{\text{data}}(z)\,\mathrm{d}z\)
Marginal Vector Field \(u_t^{\text{target}}(x)\) ODE follows marginal path \(\int u_t^{\text{target}}(x\|z)\,\dfrac{p_t(x\|z)\,p_{\text{data}}(z)}{p_t(x)}\,\mathrm{d}z\)
Marginal FM Loss \(L_{\text{FM}}(\theta)\) Implicitly minimized via cond FM loss \(\mathbb{E}_{t,z,x}\left[\|u_t^\theta(x) - u_t^{\text{target}}(x)\|^2\right]\)

None of these objects are tractable. But we can still learn them!