Unsupervised Learning
NoteUnsupervised Learning Models
- They are learned from a set of observed data \(\{\bm{x}_i\}_{i=1}^N\) in the absence of labels.
- All unsupervised models share this property, but they have diverse goals.
- Density estimation
- Feature learning
- Dimensionality reduction
- Clustering
- Generation
Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]
Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]
- A common strategy in unsupervised learning is to define a mapping between the data examples \(\bm{x}\) and a set of unseen latent variables \(\bm{z}\).
- These latents capture underlying structure in the dataset and usually have a lower dimension than the original data.
- A latent variable \(\bm{z}\) can be considered a compressed version of the data example \(\bm{x}\) that captures its essential qualities.
- Normalizing flows, variational autoencoders, and diffusion models are probabilistic generative models
- In addition to generating new examples, they
- assign a probability \(p(\bm{x} \mid \bm{\theta})\) to each data point \(\bm{x}\).
- The dependence on model parameters \(\bm{\theta}\) implies that we can try to maximize the log-likelihood of observed data \(\{\bm{x}_i\}_{i=1}^N\): \[ \bm{\theta}^* = \argmax_{\bm{\theta}} \sum_{i=1}^N \log p(\bm{x}_i \mid \bm{\theta}). \]
- Since probability distributions must sum to one, this implicitly reduces the probability of examples that lie far from the observed data.
- As well as providing a training criterion, assigning probabilities is useful in its own right:
- the probability on a test set can be used to compare two models quantitatively.
- the probability of an example can be thresholded to determine if it belongs to the same dataset or is an outlier.
- Generative adversarial networks (GANs) are also generative models, but they do not assign probabilities to data examples.
- We will not talk about these in this course.
TipWhat makes a good generative model?
- Efficient sampling: Generative samples from the model should be computationally inexpensive and take advantage of the parallelism of modern hardware.
- High-quality sampling: The samples should be indistinguishable from the real data with which the model was trained.
- Coverage: Samples should represent the entire training distribution. It is insufficient to generate samples that all look like a subset of the training examples.
- Well-behaved latent space: Every latent variable \(\bm{z}\) corresponds to a plausible data example \(\bm{x}\). Smooth changes in \(\bm{z}\) correspond to smooth changes in \(\bm{x}\).
- Disentangled latent space: Manipulating each dimension of \(\bm{z}\) should correspond to changing an interpretable property of the data. For example, in a model of language, it might change the topic, tense, or verbosity.
- Efficient likelihood computation: If the model is probabilistic, we would like to be able to calculate the probability of new examples efficiently and accurately.
| Model | Efficient | Sample quality | Coverage | Well-behaved latent space | Disentangled latent space | Efficient likelihood |
|---|---|---|---|---|---|---|
| GANs | \(\checkmark\) | \(\checkmark\) | ✗ | \(\checkmark\) | ? | n/a |
| VAEs | \(\checkmark\) | ✗ | ? | \(\checkmark\) | ? | ✗ |
| Flows | \(\checkmark\) | ✗ | ? | \(\checkmark\) | ? | \(\checkmark\) |
| Diffusion | ✗ | \(\checkmark\) | ? | ✗ | ✗ | ✗ |