$$ % Define your custom commands here \newcommand{\bmat}[1]{\begin{bmatrix}#1\end{bmatrix}} \newcommand{\E}{\mathbb{E}} \newcommand{\P}{\mathbb{P}} \newcommand{\S}{\mathbb{S}} \newcommand{\R}{\mathbb{R}} \newcommand{\S}{\mathbb{S}} \newcommand{\norm}[2]{\|{#1}\|_{{}_{#2}}} \newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\pdd}[2]{\frac{\partial^2 #1}{\partial #2^2}} \newcommand{\vectornorm}[1]{\left|\left|#1\right|\right|} \newcommand{\abs}[1]{\left|{#1}\right|} \newcommand{\mbf}[1]{\mathbf{#1}} \newcommand{\mc}[1]{\mathcal{#1}} \newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\nicefrac}[2]{{}^{#1}\!/_{\!#2}} \newcommand{\argmin}{\operatorname*{arg\,min}} \newcommand{\argmax}{\operatorname*{arg\,max}} $$

Variational Autoencoders

Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0] and
An Introduction to Variational Autoencoders by Kingma, D. P. and Welling, M.

Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]

Linear latent variable models

  • LVMs model a joint distribution \(p(\bm{x}, \bm{z})\) of the data \(\bm{x}\) and a latent variable \(\bm{z}\).
  • They then describe \(p(\bm{x})\) as the marginal distribution of the joint distribution: \[ p(\bm{x}) = \int p(\bm{x}, \bm{z}) d\bm{z} = \int p(\bm{x} | \bm{z}) p(\bm{z}) d\bm{z}. \]
  • This is a rather indirect approach to describint \(p(\bm{x})\).
    • Useful because expressions for \(p(\bm{x} | \bm{z})\) and \(p(\bm{z})\) are often much simpler than that for \(p(\bm{x})\).
TipExample: Gaussian Mixture Models
  • Let’s take a \(1\)D mixture of Gaussians
    • \(z\) is discrete: \(p(z)\) is a categorical distribution with probability \(\lambda_n\) for every possible value of \(z\).
    • The likelihood \(p(x \mid z = n)\) of the data is normally distributed with mean \(\mu_n\) and variance \(\sigma_n^2\). \[ \begin{aligned} p(z = n) &= \lambda_n \\ p(x \mid z = n) &= \mathcal{N}(x \mid \mu_n, \sigma_n^2). \end{aligned} \]
  • The data likelihood is given by marginalizing over the latent variable \(z\) \[ p(x) = \sum_{n=1}^N p(x, z=n) = \sum_{n=1}^N p(x \mid z=n) p(z=n) = \sum_{n=1}^N \lambda_n \mathcal{N}(x \mid \mu_n, \sigma_n^2). \]
  • Note that the likelihood and prior are both simple expressions, but the resulting data likelihood is multimodal!

Mixture of Gaussians (MoG). a) The MoG describes a complex probability distribution (cyan curve) as a weighted sum of Gaussian components (dashed curves). b) This sum is the marginalization of the joint density p(x, z) between the continuous observed data x and a discrete latent variable z.

Mixture of Gaussians (MoG).
a) The MoG describes a complex probability distribution (cyan curve) as a weighted sum of Gaussian components (dashed curves).
b) This sum is the marginalization of the joint density \(p(x, z)\) between the continuous observed data \(x\) and a discrete latent variable \(z\).

Nonlinear latent variable models

In a nonlinear latent variable model, both the data \(\bm{x}\) and the latent variable \(\bm{z}\) are continuous and multivariate.

  • Both the prior \(p(\bm{z})\) and the likelihood \(p(\bm{x} \mid \bm{z})\) are normally distributed.
    • \(p(\bm{z}) = \mathcal{N}(\bm{z} \mid \bm{0}, \bm{I})\)
    • \(p(\bm{x} \mid \bm{z}, \bm{\theta}) = \mathcal{N}(\bm{x} \mid \mu(\bm{z}), \Sigma(\bm{z}))\)
  • Say, the likelihood is modeling a continuous variable: the mean and covariance are given by neural networks
    • Many times the covariance is fixed and assumed to be diagonal: \(\Sigma(\bm{z}) = \sigma^2 \bm{I}\) (\(\sigma\) can be learned, too)
    • The mean is given by a neural network: \(\mu(\bm{z}) = f(\bm{z}; \bm{\theta})\)
  • The latent variable \(\bm{z}\) is lower dimensional than the data \(\bm{x}\).
  • In this example, the data probability \(p(\bm{x} \mid \bm{\theta})\) is found by marginalizing over the latent variable \(\bm{z}\) \[ p(\bm{x} \mid \bm{\theta}) = \int p(\bm{x} \mid \bm{z}, \bm{\theta}) p(\bm{z}) d\bm{z} = \int \mathcal{N}\left(\bm{x} \mid f(\bm{z}; \bm{\theta}), \sigma^2 \bm{I}\right) \mathcal{N}(\bm{z} \mid \bm{0}, \bm{I}) d\bm{z} \]

Nonlinear latent variable model. A complex 2D density p(\bm{x}) (right) is created as the marginalization of the joint distribution p(\bm{x}, z) (left) over the latent variable z; to create p(\bm{x}), we integrate the 3D volume over the dimension z. For each z, the distribution over \bm{x} is a spherical Gaussian (two slices shown) with a mean f(\bm{z}; \bm{\theta}) that is a nonlinear function of z and depends on parameters \theta. The distribution p(\bm{x}) is a weighted sum of these Gaussians.

Nonlinear latent variable model. A complex \(2\)D density \(p(\bm{x})\) (right) is created as the marginalization of the joint distribution \(p(\bm{x}, z)\) (left) over the latent variable \(z\); to create \(p(\bm{x})\), we integrate the \(3\)D volume over the dimension \(z\). For each \(z\), the distribution over \(\bm{x}\) is a spherical Gaussian (two slices shown) with a mean \(f(\bm{z}; \bm{\theta})\) that is a nonlinear function of \(z\) and depends on parameters \(\theta\). The distribution \(p(\bm{x})\) is a weighted sum of these Gaussians.
  • This can be viewed as an infinite weighted sum (i.e., an infinite mixture)
    • Mixture is of spherical Gaussians with different means
    • The weights are \(p(\bm{z})\) and the means are \(f(\bm{z}; \bm{\theta})\)

Generation

A new example \(\bm{x}^*\) is generated by ancestral sampling.

  • Draw \(\bm{z}^* \sim p(\bm{z})\)
  • Draw \(\bm{x}^* \sim p(\bm{x} \mid \bm{z}^*)\)
    • Pass \(\bm{z}^*\) through the decoder network \(f(\bm{z}^*; \bm{\theta})\) to compute the mean of \(p(\bm{x} \mid \bm{z}^*)\)
    • Draw \(\bm{x}^*\) from \(\mathcal{N}(\bm{x}^* \mid f(\bm{z}^*; \bm{\theta}), \sigma^2 \bm{I})\)

Generation from nonlinear latent variable model. a) We draw a sample z^* from the prior probability p(z) over the latent variable. b) A sample \bm{x}^* is then drawn from p(\bm{x} \mid z^*, \bm{\theta}). This is a spherical Gaussian with a mean that is a nonlinear function f(\cdot; \bm{\theta}) of z^* and a fixed variance \sigma^2 \bm{I}. c) If we repeat this process many times, we recover the density p(\bm{x} \mid \bm{\theta}).

Generation from nonlinear latent variable model. a) We draw a sample \(z^*\) from the prior probability \(p(z)\) over the latent variable. b) A sample \(\bm{x}^*\) is then drawn from \(p(\bm{x} \mid z^*, \bm{\theta})\). This is a spherical Gaussian with a mean that is a nonlinear function \(f(\cdot; \bm{\theta})\) of \(z^*\) and a fixed variance \(\sigma^2 \bm{I}\). c) If we repeat this process many times, we recover the density \(p(\bm{x} \mid \bm{\theta})\).

Training

  • We want to maximize the log-likelihood over a training dataset \(\{ \bm{x}_i \}_{i=1}^I\) with respect to \(\bm{\theta}\) \[ \bm{\theta}^* = \arg\max_{\bm{\theta}} \sum_{i=1}^I \log p(\bm{x}_i \mid \bm{\theta}) = \argmax_{\bm{\theta}} \sum_{i=1}^I \log \int p(\bm{x}_i \mid \bm{z}, \bm{\theta}) p(\bm{z}) d\bm{z}, \tag{1}\] where for the Gaussian likelihood example, we would have \[ p(\bm{x}_i \mid \bm{\theta}) = \int \mathcal{N}\left(\bm{x}_i \mid f(\bm{z}; \bm{\theta}), \sigma^2 \bm{I}\right) \mathcal{N}(\bm{z} \mid \bm{0}, \bm{I}) d\bm{z}. \]
  • Unfortunately, this is intractable to compute directly.
    • No closed-form expression for the integral
    • No easy way to evaluate it for a particular value of \(\bm{x}\)
  • To make progress, we define a lower bound on the log-likelihood.
    • Always less than or equal to the log-likelihood for a given value of \(\bm{\theta}\)
    • Depends on some other parameters \(\bm{\phi}\)
    • We will build a network to compute this lower bound and optimize it.
ImportantJensen’s inequality

A concave function \(g(\cdot)\) of the expectation of data \(y\) is greater than or equal to the expectation of the function of the data: \[ g\left(\mathbb{E}[y]\right) \geq \mathbb{E}[g(y)]. \]

  • Using Jensen’s equality with \(g = \log\), we obtain \[ \log \E(y) = \log \int p(y)y dy \geq \int p(y) \log{(y)} dy = \E \log{y}. \]

  • In fact, the slightly more general statement is true: \[ \log \int p(y)h(y)dy \geq \int p(y) \log{(h(y))}dy, \] for some function \(h(y)\) of \(y\).

  • The intractability of \(p(\bm{x} \mid \theta)\) is related to the intractability of the posterior distribution \(p(\bm{z} \mid \bm{x}, \bm{\theta})\).

    • Note that the joint distribution \(p(\bm{x}, \bm{z} \mid \theta)\) is efficient to compute and we have \[ p(\bm{z} \mid \bm{x}, \bm{\theta}) = \frac{p(\bm{x}, \bm{z} \mid \bm{\theta})}{p(\bm{x} \mid \bm{\theta})} \]
    • Since \(p(\bm{x}, \bm{z} \mid \bm{\theta})\) is tractable to compute, a tractable marginal likelihood \(p(\bm{x} \mid \bm{\theta})\) leads to a tractable posterior \(p(\bm{z} \mid \bm{x}, \bm{\theta})\), and vice versa. Both are intractable!

Posterior distribution over the latent variable. a) The posterior distribution p(z \mid \bm{x}^*, \bm{\theta}) is the distribution over the values of the latent variable z that could be responsible for a data point \bm{x}^*. We calculate this via Bayes’s rule p(z \mid \bm{x}^*, \bm{\theta}) \propto p(\bm{x}^* \mid z, \bm{\theta}) p(z). b) We compute the first term on the right hand side (the likelihood) by assessing the probability of \bm{x}^* against the symmetric Gaussian associated with each value of z. Here, it was more likely to have been created from z_1 than z_2. The second term is the prior probability p(z) over the latent variable. Combining these two factors and normalizing so the distribution sums to one gives us the posterior distribution p(z \mid \bm{x}^*, \bm{\theta}).

Posterior distribution over the latent variable. a) The posterior distribution \(p(z \mid \bm{x}^*, \bm{\theta})\) is the distribution over the values of the latent variable \(z\) that could be responsible for a data point \(\bm{x}^*\). We calculate this via Bayes’s rule \(p(z \mid \bm{x}^*, \bm{\theta}) \propto p(\bm{x}^* \mid z, \bm{\theta}) p(z)\). b) We compute the first term on the right hand side (the likelihood) by assessing the probability of \(\bm{x}^*\) against the symmetric Gaussian associated with each value of \(z\). Here, it was more likely to have been created from \(z_1\) than \(z_2\). The second term is the prior probability \(p(z)\) over the latent variable. Combining these two factors and normalizing so the distribution sums to one gives us the posterior distribution \(p(z \mid \bm{x}^*, \bm{\theta})\).
  • Let us introduce a parametric inference model \(q(\bm{z} \mid \bm{x}, \bm{\phi})\).

    • Also called an encoder or recognition model.
    • With \(\bm{\phi}\), we indicate the variational parameters.
    • We optimize the variational parameters \(\bm{\phi}\) such that \[ q(\bm{z} \mid \bm{x}, \bm{\phi}) \approx p(\bm{z} \mid \bm{x}, \bm{\theta}). \]
  • For any choice of inference model \(q(\bm{z} \mid \bm{x}, \bm{\phi})\), including the choice of variational parameters \(\bm{\phi}\), we have \[ \begin{aligned} \log p(\bm{x} \mid \bm{\theta}) &= \E_{q(\bm{z} \mid \bm{x}, \bm{\phi})}[\log p(\bm{x} \mid \bm{\theta})] \\ &= \E_{q(\bm{z} \mid \bm{x}, \bm{\phi})}\left[ \log \left( \frac{p(\bm{x}, \bm{z} \mid \bm{\theta})}{p(\bm{z} \mid \bm{x}, \bm{\theta})} \right) \right] \\ &= \E_{q(\bm{z} \mid \bm{x}, \bm{\phi})}\left[ \log \left( \frac{p(\bm{x}, \bm{z} \mid \bm{\theta})}{q(\bm{z} \mid \bm{x}, \bm{\phi})} \frac{q(\bm{z} \mid \bm{x}, \bm{\phi})}{p(\bm{z} \mid \bm{x}, \bm{\theta})} \right) \right] \\ &= \underbrace{\E_{q(\bm{z} \mid \bm{x}, \bm{\phi})}\left[ \log \left( \frac{p(\bm{x}, \bm{z} \mid \bm{\theta})}{q(\bm{z} \mid \bm{x}, \bm{\phi})} \right) \right]}_{=\mc{L}_{\theta, \phi}(x) \\[1pt] \text{(ELBO)}} + \underbrace{\E_{q(\bm{z} \mid \bm{x}, \bm{\phi})}\left[ \log \left( \frac{q(\bm{z} \mid \bm{x}, \bm{\phi})}{p(\bm{z} \mid \bm{x}, \bm{\theta})} \right) \right]}_{=D_{\text{KL}}(q(\bm{z} \mid \bm{x}, \bm{\phi})\;\Vert\; p(\bm{z} \mid \bm{x}, \bm{\theta}))} \end{aligned} \tag{2}\]

  • The second term is the Kullback-Leibler (KL) divergence which is nonnegative: \[ D_{\text{KL}}(q(\bm{z} \mid \bm{x}, \bm{\phi})\;\Vert\; p(\bm{z} \mid \bm{x}, \bm{\theta})) \geq 0 \] and zero if and only if \(q(\bm{z} \mid \bm{x}, \bm{\phi}) = p(\bm{z} \mid \bm{x}, \bm{\theta})\).

  • The first term is the variational lower bound, also called the evidence lower bound (ELBO): \[ \mc{L}_{\theta, \phi}(x) = \E_{q_\phi(\bm{z} \mid \bm{x})} \left[ \log p_\theta(\bm{x}, \bm{z}) - \log q_\phi(\bm{z} \mid \bm{x}) \right] \tag{3}\]

  • Due to the nonnegativity of the KL divergence, the ELBO is a lower bound on the log-likelihood of the data: \[ \mc{L}_{\theta, \phi}(\bm{x}) = \log p_\theta(\bm{x}) - D_{\text{KL}}\left(q_\phi(\bm{z} \mid \bm{x})\,\Vert\,p_\theta(\bm{z} \mid \bm{x})\right) \leq \log p_\theta(\bm{x}). \tag{4}\]

  • Hence, the KL divergence \(D_{\text{KL}}\left(q_\phi(\bm{z} \mid \bm{x})\,\Vert\,p_\theta(\bm{z} \mid \bm{x})\right)\) determines two “distances”:

    1. By definition, the KL divergence of the approximate posterior from the true posterior.
    2. The gap between the ELBO \(\mc{L}_{\theta, \phi}(\bm{x})\) and the marginal likelihood \(\log p_\theta(\bm{x})\).
    • The latter is also called the tightness of the bound.
    • The better \(q_\phi(\bm{z} \mid \bm{x})\) approximates the true (posterior) distribution \(p_\theta(\bm{z} \mid \bm{x})\), in terms of the KL divergence, the smaller the gap.

Variational approximation. The posterior p(\bm{z} \mid \bm{x}^*, \bm{\theta}) is intractable to compute. The variational approximation chooses a family of distributions q(\bm{z} \mid \bm{x}, \bm{\phi}) (here Gaussians) and tries to find the closest member of this family to the true posterior. a) Sometimes, the approximation (cyan curve) is good and lies close tot he true posterior (orange curve). b) However, if the posterior is multi-modal (as in the previous figure), then the Gaussian approximation will be poor.

Variational approximation. The posterior \(p(\bm{z} \mid \bm{x}^*, \bm{\theta})\) is intractable to compute. The variational approximation chooses a family of distributions \(q(\bm{z} \mid \bm{x}, \bm{\phi})\) (here Gaussians) and tries to find the closest member of this family to the true posterior. a) Sometimes, the approximation (cyan curve) is good and lies close tot he true posterior (orange curve). b) However, if the posterior is multi-modal (as in the previous figure), then the Gaussian approximation will be poor.
TipTwo for One

By looking at Equation 4, it can be understood that maximization of the ELBO \(\mc{L}_{\theta, \phi}(x)\) with respect to \(\bm{\theta}\) and \(\bm{\phi}\), will concurrently optimize the two things we care about:

  1. It will approximately maximize the marginal likelihood \(p_\theta(\bm{x})\). This means that our generative model or decoder will become better.
  2. It will minimize the KL divergence of the approximation \(q_\phi(\bm{z} \mid \bm{x})\) from the true posterior \(p_\theta(\bm{z} \mid \bm{x})\), so \(q_\phi(\bm{z} \mid \bm{x})\) becomes better.
CautionELBO through Jensen’s inequality

We can also see that the ELBO is a lower bound on the log-likelihood by using Jensen’s inequality. \[ \begin{aligned} \log p(\bm{x} \mid \bm{\theta}) &= \log \int p(\bm{x}, \bm{z} \mid \bm{\theta}) d\bm{z} = \log \int p(\bm{x}, \bm{z} \mid \bm{\theta}) \frac{q(\bm{z} \mid \bm{x}, \bm{\phi})}{q(\bm{z} \mid \bm{x}, \bm{\phi})} d\bm{z} = \log \E_{q(\bm{z} \mid \bm{x}, \bm{\phi})} \left[ \frac{p(\bm{x}, \bm{z} \mid \bm{\theta})}{q(\bm{z} \mid \bm{x}, \bm{\phi})} \right] \\ &\geq \E_{q(\bm{z} \mid \bm{x}, \bm{\phi})} \left[ \log \frac{p(\bm{x}, \bm{z} \mid \bm{\theta})}{q(\bm{z} \mid \bm{x}, \bm{\phi})} \right] = \mc{L}_{\theta, \phi}(x) \end{aligned} \]

Evidence lower bound (ELBO). The goal is to maximize the log-likelihood \log p_\theta(\bm{x}) (black curve) with respect to the parameters \bm{\theta}. The ELBO is a function that lies everywhere below the log-likelihood. It is a function of both \bm{\theta} and a second set of parameters \bm{\phi}. For fixed \bm{\phi}, we get a function of \bm{\theta} (two colored curves for different values of \bm{\phi}). Consequently, we can increase the log-likelihood by either improving the ELBO with respect to a) the new parameters \bm{\phi} (moving from colored curve to colored curve) or b) the original parameters \bm{\theta} (moving along the current colored curve).

Evidence lower bound (ELBO). The goal is to maximize the log-likelihood \(\log p_\theta(\bm{x})\) (black curve) with respect to the parameters \(\bm{\theta}\). The ELBO is a function that lies everywhere below the log-likelihood. It is a function of both \(\bm{\theta}\) and a second set of parameters \(\bm{\phi}\). For fixed \(\bm{\phi}\), we get a function of \(\bm{\theta}\) (two colored curves for different values of \(\bm{\phi}\)). Consequently, we can increase the log-likelihood by either improving the ELBO with respect to a) the new parameters \(\bm{\phi}\) (moving from colored curve to colored curve) or b) the original parameters \(\bm{\theta}\) (moving along the current colored curve).
WarningTwo Models to Learn
  1. Encoder or Recognition or Inference Model \[ \begin{aligned} (\bm{\mu}, \log \bm{\sigma}) &= \operatorname{EncoderNeuralNet}_\phi(\bm{x}) \\ q_\phi(\bm{z} \mid \bm{x}) &= \mc{N}(\bm{z}; \bm{\mu}, \operatorname{diag}{(\bm{\sigma})}). \end{aligned} \]

  2. Decoder or Generative Model (Binary Data) \[ \begin{aligned} \bm{p} &= \operatorname{DecoderNeuralNet_\theta(\bm{z})} \\ \log p_\theta(\bm{x} \mid \bm{z}) &= \sum_{j=1}^D \log p(x_j \mid \bm{z}) = \sum_{j=1}^D \operatorname{Bernoulli(x_j; p_j)} \\ &= \sum_{j=1}^D \left[ x_j \log p_j + (1-x_j) \log(1-p_j) \right] \end{aligned} \] where \(\forall p_j \in \bm{p}; 0 \leq p_j \leq 1\) (e.g. implemented through a sigmoid nonlinearity as the last layer of the \(\operatorname{DecoderNeuralNet}_\theta(\cdot)\)), where \(D\) is the dimensionality of \(\bm{x}\), and \(\operatorname{Bernoulli}(\cdot; p)\) is the probability mass function of the Bernoulli distribution.

NoteSummary

Since we the log-likelihood is intractable to compute, we cannot perform the optimization in Equation 1. Instead, we will maximize the ELBO \(\mc{L}_{\theta, \phi}(x)\) with respect to both the decoder and encoder \((\bm{\theta}, \bm{\phi})\) as a proxy.

Variational autoencoder. The encoder g(\bm{x}; \bm{\theta}) takes a training example \bm{x} and predicts the parameters \bm{\mu}, \bm{\Sigma} of the variational distribution q_\theta(\bm{z} \mid \bm{x}). We sample from this distribution and then use the decoder f(\bm{z}; \bm{\phi}) to predict the data \bm{x}. The loss function is the negative ELBO, which depends on how accurate this prediction is and how similar the variational distribution q_\phi(\bm{z} \mid \bm{x}) is to the prior p(\bm{z}).

Variational autoencoder. The encoder \(g(\bm{x}; \bm{\theta})\) takes a training example \(\bm{x}\) and predicts the parameters \(\bm{\mu}, \bm{\Sigma}\) of the variational distribution \(q_\theta(\bm{z} \mid \bm{x})\). We sample from this distribution and then use the decoder \(f(\bm{z}; \bm{\phi})\) to predict the data \(\bm{x}\). The loss function is the negative ELBO, which depends on how accurate this prediction is and how similar the variational distribution \(q_\phi(\bm{z} \mid \bm{x})\) is to the prior \(p(\bm{z})\).

How to maximize ELBO?

  • Unbiased gradients of the ELBO with respect to the generative model parameters \(\bm{\theta}\) are simple to obtain: \[ \begin{aligned} \nabla_\theta \mc{L}_{\theta, \phi}(x) &= \nabla_\theta \E_{q_\phi(\bm{z} \mid \bm{x})} \left[\log p_\theta(\bm{x}, \bm{z}) - \log q_\phi(\bm{z} \mid \bm{x}) \right] \\ &= \E_{q_\phi(\bm{z} \mid \bm{x})} \left[\nabla_\theta\left( \log p_\theta(\bm{x}, \bm{z}) - \log q_\phi(\bm{z} \mid \bm{x}) \right) \right] \\ &\simeq \nabla_\theta\left( \log p_\theta(\bm{x}, \bm{z}) - \log q_\phi(\bm{z} \mid \bm{x}) \right) \\ &= \nabla_\theta \log p_\theta(\bm{x}, \bm{z}). \end{aligned} \] The last line is a simple Monte Carlo estimator of the second line, where \(\bm{z}\) in the last two lines is a random sample from \(q_\phi(\bm{z} \mid \bm{x})\).

  • Unbiased gradients with respect to the variational parmaeters \(\bm{\phi}\) are more difficult to obtain.

    • ELBO’s expectation is taken with respect to the distribution \(q_\phi(\bm{z} \mid \bm{x})\).
    • This is a function of \(\bm{\phi}\)!

\[ \begin{aligned} \nabla_\phi \mc{L}_{\theta, \phi}(\bm{x}) &= \nabla_\phi \E_{q_\phi(\bm{z} \mid \bm{x})} \left[\log p_\theta(\bm{x}, \bm{z}) - \log q_\phi(\bm{z} \mid \bm{x}) \right] \\ &\neq \E_{q_\phi(\bm{z} \mid \bm{x})} \left[\nabla_\phi \left( \log p_\theta(\bm{x}, \bm{z}) - \log q_\phi(\bm{z} \mid \bm{x}) \right) \right] \end{aligned} \]

ImportantThe reparametrization trick (law of the unconscious statistician: lotus)

For continuous latent variables and a differentiable encoder and generative model, the ELBO can be straightforwardly differentiated with respect to both \(\bm{\phi}\) and \(\bm{\theta}\) through a change of variables, also called the reparametrization trick.

  • Express the random variable \(\bm{z} \sim q_\phi(\bm{z} \mid \bm{x})\) as some differentiable (and invertible) transformation.
    • This is a function of another random variable \(\bm{\varepsilon}\), given \(\bm{z}\) and \(\bm{\phi}\): \[ \bm{z} = \bm{h}(\bm{\varepsilon}, \bm{\phi}, \bm{x}) \]
    • The random variable \(\bm{\varepsilon}\) is independent of \(\bm{x}\) or \(\bm{\phi}\).
  • Now, the expectations can be rewritten in terms of \(\bm{\varepsilon}\): \[ \E_{q_\phi(\bm{z} \mid \bm{x})}[f(\bm{z})] = \E_{p(\bm{\varepsilon})}[f(\bm{z})]. \]
  • This makes the expectation and gradient operators commutative, and we can form a simple Monte Carlo estimator: \[ \begin{aligned} \nabla_\phi \E_{q_\phi(\bm{z} \mid \bm{x})}[f(\bm{z})] &= \nabla_\phi \E_{p(\bm{\varepsilon})}[f(\bm{z})] \\ &= \E_{p(\bm{\varepsilon})}[\nabla_\phi f(\bm{z})] \\ &\simeq \nabla_\phi f(\bm{z}) \end{aligned} \] where in the last line, \(\bm{z} = h(\bm{\varepsilon}, \bm{\phi}, \bm{x})\) with random noise sample \(\bm{\varepsilon} \sim p(\bm{\varepsilon})\).

Reparametrization trick. The variational parameters \bm{\phi} affect the objective f through the random variable \bm{z} \sim q_\phi(\bm{z} \mid \bm{x}). We wish to compute gradients \nabla_\phi f to optimize the objective with SGD. In the original form (left), we cannot differentiate f w.r.t. \bm{\phi}, because we cannot directly backpropagate gradients through the random variable \bm{z}. We can “externalize” the randomness in \bm{z} by re-parametrizing the variable as a deterministic and differentiable function of \bm{\phi}, bm{x}, and a newly introduced random variable \bm{\varepsilon} (right). This allows us to “backprop through \bm{z},” and compute gradients \nabla_\phi f.

Reparametrization trick. The variational parameters \(\bm{\phi}\) affect the objective \(f\) through the random variable \(\bm{z} \sim q_\phi(\bm{z} \mid \bm{x})\). We wish to compute gradients \(\nabla_\phi f\) to optimize the objective with SGD. In the original form (left), we cannot differentiate \(f\) w.r.t. \(\bm{\phi}\), because we cannot directly backpropagate gradients through the random variable \(\bm{z}\). We can “externalize” the randomness in \(\bm{z}\) by re-parametrizing the variable as a deterministic and differentiable function of \(\bm{\phi}\), \(bm{x}\), and a newly introduced random variable \(\bm{\varepsilon}\) (right). This allows us to “backprop through \(\bm{z}\),” and compute gradients \(\nabla_\phi f\).
NoteGradient of ELBO
  • Under the reparametrization, we can replace an expectation with respect to \(q_\phi(\bm{z} \mid \bm{x})\) with one with respect to \(p(\bm{\varepsilon})\).

    • Now, the ELBO can be written as \[ \begin{aligned} \mc{L}_{\theta, \phi}(\bm{x}) &= \E_{q_\phi(\bm{z} \mid \bm{x})} \left[ \log p_\theta(\bm{x}, \bm{z}) - \log q_\phi(\bm{z} \mid \bm{x}) \right] \\ &= \E_{p(\bm{\varepsilon})} \left[ \log p_\theta(\bm{x}, \bm{z}) - \log q_\phi(\bm{z} \mid \bm{x}) \right], \end{aligned} \] where \(\bm{z} = \bm{h}(\bm{\varepsilon}, \bm{\phi}, \bm{x})\).
  • As a result, we can form a simple Monte Carlo estimator \(\tilde{\mc{L}}_{\theta, \phi}(\bm{x})\): \[ \begin{aligned} \bm{\varepsilon} &\sim p(\bm{\varepsilon}) \\ \bm{z} &= \bm{h}(\bm{\varepsilon}, \bm{\phi}, \bm{x}) \\ \tilde{\mc{L}}_{\theta, \phi}(\bm{x}) &= \log p_\theta(\bm{x}, \bm{z}) - \log q_\phi(\bm{z} \mid \bm{x}) \end{aligned} \]

  • The resulting gradient \(\nabla_\phi \tilde{\mc{L}}_{\theta, \phi}(\bm{x})\) is used to optimize the ELBO using minibatch SGD.

Reparametrization trick. With the original architecture, we cannot easily backpropagate through the sampling step. The reparametrization trick removes the sampling step from the main pipeline; we draw from a standard normal and combine this with the predicted mean and covariance to get a sample from the variational distribution.

Reparametrization trick. With the original architecture, we cannot easily backpropagate through the sampling step. The reparametrization trick removes the sampling step from the main pipeline; we draw from a standard normal and combine this with the predicted mean and covariance to get a sample from the variational distribution.

The VAE updates both factors that determine the lower bound at each iteration. Both parameters \bm{\theta} of the decoder and the parameters \bm{\theta} of the encoder are manipulated to increase this lower bound.

The VAE updates both factors that determine the lower bound at each iteration. Both parameters \(\bm{\theta}\) of the decoder and the parameters \(\bm{\theta}\) of the encoder are manipulated to increase this lower bound.