Week 3: Fundamentals of Deep Neural Networks
Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]
Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]
These are networks with more than one hidden layer, making intuitive understanding more difficult.
Composing neural networks
Network 1:
\[ \begin{split} h_1 &= \sigma(\theta_{10} + \theta_{11}x) \\ h_2 &= \sigma(\theta_{20} + \theta_{21}x) \\ h_3 &= \sigma(\theta_{30} + \theta_{31}x) \end{split} \]
\[ y = \phi_0 + \phi_1h_1 + \phi_2h_2 + \phi_3h_3 \]
Network 2:
\[ \begin{split} h_1^\prime &= \sigma(\theta_{10}^\prime + \theta_{11}^\prime y) \\ h_2^\prime &= \sigma(\theta_{20}^\prime + \theta_{21}^\prime y) \\ h_3^\prime &= \sigma(\theta_{30}^\prime + \theta_{31}^\prime y) \end{split} \]
\[ y^\prime = \phi_0^\prime + \phi_1^\prime h_1 + \phi_2^\prime h_2 + \phi_3^\prime h_3 \]
Combining the two networks into one
Network 1:
\[ \begin{split} h_1 &= \sigma(\theta_{10} + \theta_{11}x) \\ h_2 &= \sigma(\theta_{20} + \theta_{21}x) \\ h_3 &= \sigma(\theta_{30} + \theta_{31}x) \end{split} \]
\[ y = \phi_0 + \phi_1h_1 + \phi_2h_2 + \phi_3h_3 \]
Network 2:
\[ \begin{split} h_1^\prime &= \sigma(\theta_{10}^\prime + \theta_{11}^\prime y) \\ h_2^\prime &= \sigma(\theta_{20}^\prime + \theta_{21}^\prime y) \\ h_3^\prime &= \sigma(\theta_{30}^\prime + \theta_{31}^\prime y) \end{split} \]
\[ y^\prime = \phi_0^\prime + \phi_1^\prime h_1 + \phi_2^\prime h_2 + \phi_3^\prime h_3 \]
Hidden units of the second network in terms of the first: \[ \begin{split} h_1^\prime &= \sigma(\theta_{10}^\prime + \theta_{11}^\prime y) = \sigma\left(\theta_{10}^\prime + \theta_{11}^\prime \phi_0 + \theta_{11}^\prime \phi_1h_1 + \theta_{11}^\prime\phi_2h_2 + \theta_{11}^\prime\phi_3h_3 \right) \\ h_2^\prime &= \sigma(\theta_{20}^\prime + \theta_{21}^\prime y) = \sigma\left(\theta_{20}^\prime + \theta_{21}^\prime \phi_0 + \theta_{21}^\prime \phi_1h_1 + \theta_{21}^\prime\phi_2h_2 + \theta_{21}^\prime\phi_3h_3 \right) \\ h_3^\prime &= \sigma(\theta_{30}^\prime + \theta_{31}^\prime y) = \sigma\left(\theta_{30}^\prime + \theta_{31}^\prime \phi_0 + \theta_{31}^\prime \phi_1h_1 + \theta_{31}^\prime\phi_2h_2 + \theta_{31}^\prime\phi_3h_3 \right) \end{split} \]
which we can rewrite as
\[ \begin{split} h_1^\prime &= \sigma\left(\psi_{10} + \psi_{11}h_1 + \psi_{12}h_2 + \psi_{13}h_3 \right) \\ h_2^\prime &= \sigma\left(\psi_{20} + \psi_{21}h_1 + \psi_{22}h_2 + \psi_{23}h_3 \right) \\ h_3^\prime &= \sigma\left(\psi_{30} + \psi_{31}h_1 + \psi_{32}h_2 + \psi_{33}h_3 \right) \end{split} \]
Two-layer network
\[ \begin{split} h_1 &= \sigma(\theta_{10} + \theta_{11}x) \\ h_2 &= \sigma(\theta_{20} + \theta_{21}x) \\ h_3 &= \sigma(\theta_{30} + \theta_{31}x) \end{split} \tag{1}\]
\[ \begin{split} h_1^\prime &= \sigma\left(\psi_{10} + \psi_{11}h_1 + \psi_{12}h_2 + \psi_{13}h_3 \right) \\ h_2^\prime &= \sigma\left(\psi_{20} + \psi_{21}h_1 + \psi_{22}h_2 + \psi_{23}h_3 \right) \\ h_3^\prime &= \sigma\left(\psi_{30} + \psi_{31}h_1 + \psi_{32}h_2 + \psi_{33}h_3 \right) \end{split} \tag{2}\]
\[ y^\prime = \phi_0^\prime + \phi_1^\prime h_1^\prime + \phi_2^\prime h_2^\prime + \phi_3^\prime h_3^\prime \tag{3}\]
- The three hidden units \(h_1\), \(h_2\), and \(h_3\) in the first layer are computed as usual by forming linear functions of the input and passing these through ReLU activation functions (Equation 1).
- The pre-activations at the second layer are computed by taking three new linear functions of these hidden units (arguments of the activation functions in Equation 2). At this point, we effectively have a shallow network with three outputs; we have computed three piecewise linear functions with the “joints” between linear regions in the same places.
- At the second hidden layer, another ReLU function \(\sigma(\cdot)\) is applied to each function (Equation 2), which clips them and adds new “joints” to each.
- The final output is a linear combination of these hidden units (Equation 3).
- Regardless, it’s important not to lose sight of the fact that this is still merely an equation relating input \(x\) to output \(y^\prime\). Indeed, we can combine equations Equation 1–Equation 3 to get one expression:
\[ \begin{split} y^\prime &= \phi_0^\prime + \phi_1^\prime \sigma\left( \psi_{10} + \psi_{11}\sigma(\theta_{10} + \theta_{11}x) + \psi_{12}\sigma(\theta_{20} + \theta_{21}x) + \psi_{13}\sigma(\theta_{30} + \theta_{31}x) \right) \\ &\quad + \, \phi_2^\prime \sigma \left( \psi_{20} + \psi_{21}\sigma(\theta_{10} + \theta_{11}x) + \psi_{22}\sigma(\theta_{20} + \theta_{21}x) + \psi_{23}\sigma(\theta_{30} + \theta_{31}x) \right) \\ &\quad + \, \phi_3^\prime \sigma \left( \psi_{30} + \psi_{31}\sigma(\theta_{10} + \theta_{11}x) + \psi_{32}\sigma(\theta_{20} + \theta_{21}x) + \psi_{33}\sigma(\theta_{30} + \theta_{31}x) \right). \end{split} \]
although this is admittedly rather difficult to understand.
Hyperparameters are parameters that are chosen before the training of the network such as
- \(K\) layers: depth of the network
- \(D_k\) hidden units/neurons per layer: width of the network
Typically different hyperparameters are probed by retraining: hyperparameter optimization
Matrix notation
\[ \begin{aligned} \begin{split} h_1 &= \sigma(\theta_{10} + \theta_{11}x) \\ h_2 &= \sigma(\theta_{20} + \theta_{21}x) \\ h_3 &= \sigma(\theta_{30} + \theta_{31}x) \end{split} &\quad {\color{DodgerBlue} \class{thick-arrow}{\xrightarrow{\hspace{1cm}}}} \quad \bmat{ h_1 \\ h_2 \\ h_3 } = \bm{\sigma}\left( \bmat{ \theta_{10} \\ \theta_{20} \\ \theta_{30} } + \bmat{ \theta_{11} \\ \theta_{21} \\ \theta_{31} } x \right) \end{aligned} \]
\[ \begin{aligned} \begin{split} h_1^\prime &= \sigma\left(\psi_{10} + \psi_{11}h_1 + \psi_{12}h_2 + \psi_{13}h_3 \right) \\ h_2^\prime &= \sigma\left(\psi_{20} + \psi_{21}h_1 + \psi_{22}h_2 + \psi_{23}h_3 \right) \\ h_3^\prime &= \sigma\left(\psi_{30} + \psi_{31}h_1 + \psi_{32}h_2 + \psi_{33}h_3 \right) \end{split} &\quad {\color{DodgerBlue} \class{thick-arrow}{\xrightarrow{\hspace{1cm}}}} \quad \bmat{ h_1^\prime \\ h_2^\prime \\ h_3^\prime } = \bm{\sigma}\left( \bmat{ \psi_{10} \\ \psi_{20} \\ \psi_{30} } + \bmat{ \psi_{11} & \psi_{12} & \psi_{13} \\ \psi_{21} & \psi_{22} & \psi_{23} \\ \psi_{31} & \psi_{32} & \psi_{33} } \bmat{ h_1 \\ h_2 \\ h_3 } \right) \end{aligned} \]
\[ \begin{aligned} y^\prime = \phi_0^\prime + \phi_1^\prime h_1^\prime + \phi_2^\prime h_2^\prime + \phi_3^\prime h_3^\prime &\quad {\color{DodgerBlue} \class{thick-arrow}{\xrightarrow{\hspace{1cm}}}} \quad y^\prime = \phi_0^\prime + \bmat{\phi_1^\prime & \phi_2^\prime & \phi_3^\prime}\bmat{h_1^\prime \\ h_2^\prime \\ h_3^\prime} \end{aligned} \]
\[ \begin{aligned} \begin{split} \bmat{ h_1 \\ h_2 \\ h_3 } = \bm{\sigma}\left( \bmat{ \theta_{10} \\ \theta_{20} \\ \theta_{30} } + \bmat{ \theta_{11} \\ \theta_{21} \\ \theta_{31} } x \right) \end{split} &\quad {\color{DodgerBlue} \class{thick-arrow}{\xrightarrow{\hspace{1cm}}}} \quad \bm{h} = \bm{\sigma}\left( \bm{\theta}_0 + \bm{\theta}x \right) \end{aligned} \]
\[ \begin{aligned} \begin{split} \bmat{ h_1^\prime \\ h_2^\prime \\ h_3^\prime } = \bm{\sigma}\left( \bmat{ \psi_{10} \\ \psi_{20} \\ \psi_{30} } + \bmat{ \psi_{11} & \psi_{12} & \psi_{13} \\ \psi_{21} & \psi_{22} & \psi_{23} \\ \psi_{31} & \psi_{32} & \psi_{33} } \bmat{ h_1 \\ h_2 \\ h_3 } \right) \end{split} &\quad {\color{DodgerBlue} \class{thick-arrow}{\xrightarrow{\hspace{1cm}}}} \quad \bm{h}^\prime = \bm{\sigma}\left( \bm{\psi}_0 + \bm{\Psi}\bm{h} \right) \end{aligned} \]
\[ \begin{aligned} y^\prime = \phi_0^\prime + \bmat{\phi_1^\prime & \phi_2^\prime & \phi_3^\prime}\bmat{h_1^\prime \\ h_2^\prime \\ h_3^\prime} &\quad {\color{DodgerBlue} \class{thick-arrow}{\xrightarrow{\hspace{1cm}}}} \quad y = \phi_0^\prime + \bm{\phi}^\prime \bm{h}^\prime \end{aligned} \]
\[ \begin{aligned} \bm{h} = \bm{\sigma}\left( \bm{\theta}_0 + \bm{\theta}x \right) &\quad {\color{DodgerBlue} \class{thick-arrow}{\xrightarrow{\hspace{1cm}}}} \quad \bm{h}_1 = \sigma\left(\bm{\beta}_0 + \bm{\Omega}_0 \bm{x} \right) \end{aligned} \]
\[ \begin{aligned} \bm{h}^\prime = \bm{\sigma}\left( \bm{\psi}_0 + \bm{\Psi}\bm{h} \right) &\quad {\color{DodgerBlue} \class{thick-arrow}{\xrightarrow{\hspace{1cm}}}} \quad \bm{h}_2 = \sigma\left( \bm{\beta}_1 + \bm{\Omega}_1 \bm{h}_1 \right) \end{aligned} \]
\[ \begin{aligned} y = \phi_0^\prime + \bm{\phi}^\prime \bm{h}^\prime &\quad {\color{DodgerBlue} \class{thick-arrow}{\xrightarrow{\hspace{1cm}}}} \quad \bm{y} = \bm{\beta}_2 + \bm{\Omega}_2\bm{h}_2 \end{aligned} \]
Let \(\bm{h}_0 = \bm{x}\), where \(\bm{x}\) is the input to the neural network. \[ \bm{h}_{k+1} = \sigma\left(\bm{\beta}_{k} + \bm{\Omega}_{k}\bm{h}_k\right), \;\; k = 0, \ldots, K-1; \quad \bm{y} = \bm{\beta}_K + \bm{\Omega}_K\bm{h}_K. \]
Shallow vs. deep neural networks
The best results are achieved by deep neural netws with many layers.
- 50-1000 layers for many applications
- Best results in \[ \begin{rcases} \text{• Computer vision} \\ \text{• Natural language processing} \\ \text{• Graph neural networks} \\ \text{• Generative models} \\ \text{• Reinforcement Learning} \end{rcases} \quad \begin{array}{l} \text{All use deep networks} \\ \text{What is the reason?} \end{array} \]
- Ability to approximate different functions?
Both obey the universal approximation theorem.
Argument: One layer is enough, and deep netwoorks could arrange for the other layers to compute the identity function.
- Number of linear regions per parameter?
- Deep networks create many more regions per parameter
- But, there are dependencies between them
- Perhaps symmetries in the real-world help? Unknown.
- Depth efficiency?
- There are some known functions that require a shallow network with exponentially more hidden units than a deep network to achieve an equivalent approximation.
- This is known as the depth efficiency of deep networks.
- But do the real-world functions we want to approximate have this property? Unknown.
- Large structured networks
- Think about images as input - might be 1M pixels.
- Fully connected networks are not practical.
- Solution is to have weights that only operate locally, and share across image.
- This idea leads to convolutional networks.
- Gradually integrate information from across the image - needs multiple layers.
- Fitting and generalization
- Fitting of deep models seems to be easier up to about 20 layers
- Then needs various tricks to train deeper networks, so fitting becomes harder.
- Generalization is good in deep networks. Why?
Training dataset of \(I\) pairs of input/output examples: \(\left\{\bm{x}_i, \bm{y}_i\right\}_{i=1}^I\)
Loss function measures how bad the model performs \(\mc{L}(\bm{\phi}) \triangleq \mc{L}\left(\bm{\phi}, f(\bm{x}; \bm{\phi}), \left\{\bm{x}_i, \bm{y}_i\right\}_{i=1}^I\right)\)
Find the parameters that minimize the loss: \(\hat{\bm{\phi}} = \argmin_{\bm{\phi}}\; \mc{L}(\bm{\phi})\)
Maximum likelihood
Model predicts output \(\bm{y}\) given input \(\bm{x}\).- Model predicts a conditional probability distribution: \(\P(\bm{y} \mid \bm{x})\) over outputs \(\bm{y}\) given inputs \(\bm{x}\).
- Loss function aims to make the outputs have high probability!
Computing a distribution over outputs
- Pick a known distribution to model output \(y\) with parameters \(\theta\)
- Example: Normal distribution \(\mc{N}(y; \bm{{\theta}}) = \mc{N}(y; \mu, \sigma^2)\)
- Use model to precit the parameters \(\bm{\theta}\) of this probability distribution.
Maximum likelihood criterion
We choose the model parameters \(\bm{\phi}\) so that they maximize the combined probability across all \(I\) training examples:
\[ \begin{aligned} \begin{split} \hat{\bm{\phi}} &= \argmax_\phi \left[ \prod_{i=1}^I \P(\bm{y}_i \mid \bm{x}_i; \bm{\theta}_i)\right] \\ &= \argmax_\phi \left[ \prod_{i=1}^I \P(\bm{y}_i \mid f(\bm{x}_i ; \bm{\phi}))\right] \end{split} \end{aligned} \]
- The data is identically distributed
- Conditional distributions \(\P(\bm{y}_i \mid \bm{x}_i)\) of the output given the input are independent so the total likelihood of the training data decomposes as \[ \P(\bm{y}_1, \bm{y}_2, \ldots, \bm{y}_I \mid \bm{x}_1, \bm{x}_2, \ldots, \bm{x}_I) = \prod_{i=1}^I \P(\bm{y}_i \mid \bm{x}_i). \]
Maximizing log-likelihood
- Each term \(\P(\bm{y}_i \mid f(\bm{x}_i; \bm{\phi}))\) can be small, so the product of many of these is numerically unstable (too small).
- Instead, we (equivalently) maximize the logarithm of the likelihood:
\[ \begin{aligned} \begin{split} \hat{\bm{\phi}} &= \argmax_\phi \left[ \prod_{i=1}^I \P(\bm{y}_i \mid f(\bm{x}_i ; \bm{\phi}))\right] \\ &= \argmax_\phi \left[ \log \left( \prod_{i=1}^I \P(\bm{y}_i \mid f(\bm{x}_i ; \bm{\phi})) \right) \right] \\ &= \argmax_\phi \left[ \sum_{i=1}^I \log \left( \P(\bm{y}_i \mid f(\bm{x}_i ; \bm{\phi})) \right) \right]. \end{split} \end{aligned} \]
- The log-likelihood has the practical advantage of using a sum of terms, not a product, so representing it with finite precision is not problematic.
By convention, we minimize things (i.e., a loss) \[ \begin{aligned} \begin{split} \hat{\bm{\phi}} &= \argmax_\phi \left[ \sum_{i=1}^I \log \left( \P(\bm{y}_i \mid f(\bm{x}_i ; \bm{\phi})) \right) \right] \\ &= \argmin_\phi \left[ -\sum_{i=1}^I \log \left( \P(\bm{y}_i \mid f(\bm{x}_i ; \bm{\phi})) \right) \right] \\ &= \argmin_\phi \; \mc{L}(\bm{\phi}). \end{split} \end{aligned} \]
Inference
- The network no longer directly predicts the outputs \(\bm{y}\) but instead a probability distribution over \(\bm{y}\).
- To obtain a point estimate, we (typically, not necessarily) return the maximum of the distribution \[ \hat{\bm{y}} = \argmax_y \; \P(\bm{y} \mid f(\bm{x}; \hat{\bm{\phi}})). \]
- Often, it is possible to find a closed-form expression for this in terms of the distribution parameters \(\bm{\theta}\), predicted by the model.
Recipe for constructing loss functions
- Choose a suitable probability distriburtion \(\P(\bm{y} \mid \bm{\theta})\) defined over the domain of the predictions \(\bm{y}\) with distribution parameters \(\bm{\theta}\).
- Set the ML model \(f(\bm{x}; \bm{\phi})\) to predict one or more of these parameters so \(\bm{\theta} = f(\bm{x}; \bm{\phi})\).
- To train the model, find the network parameters \(\hat{\bm{\phi}}\) that minimize the negative log-likelihood loss function over the training dataset pairs \(\{\bm{x}_i, \bm{y}_i\}\) \[ \hat{\bm{\phi}} = \argmin_\phi \; \mc{L}(\bm{\phi}) = \argmin_\phi \left[ -\sum_{i=1}^I \log \left( \P(\bm{y}_i \mid f(\bm{x}_i ; \bm{\phi})) \right) \right]. \]
- To perform inference for a new test example \(\bm{x}\), return either the full distribution \(\P(\bm{y} \mid f(\bm{x}; \bm{\phi}))\) or the value where this distribution is maximized.
Example 1: Univariate regression
- Goal: predict a scalar output \(y \in \R\) from an input \(\bm{x}\) using a model \(f(\bm{x}; \bm{\phi})\).
- Select the Normal distribution \(\mc{N}(y; \mu, \sigma^2)\)
Find the network parameters \(\hat{\bm{\phi}}\) that minimize the negative log-likelihood loss function over the training dataset pairs \(\{\bm{x}_i, y_i\}\) \[ \begin{aligned} \begin{split} \hat{\bm{\phi}} &= \argmin_\phi \; \mc{L}(\bm{\phi}) \\ &= \argmin_\phi \left[ -\sum_{i=1}^I \log \left( \P(y_i \mid f(\bm{x}_i ; \bm{\phi}), \sigma^2) \right) \right] \\ &= \argmin_\phi \left[ \sum_{i=1}^I \left( \frac{(y_i - f(\bm{x}_i; \bm{\phi}))^2}{2\sigma^2} + \frac{1}{2} \log(2\pi\sigma^2) \right) \right] \\ &= \argmin_\phi \left[ \sum_{i=1}^I (y_i - f(\bm{x}_i; \bm{\phi}))^2 \right] \quad {\color{DodgerBlue} \class{thick-arrow}{\xleftarrow{\hspace{1cm}}}} \quad \text{Least squares!} \end{split} \end{aligned} \]
To perform inference for a new test example \(\bm{x}\), return the mean of the predicted distribution, which corresponds to the maximum of this distribution.
When the uncertainty of the model varies as a function of the input data, we refer to this as heteroscedastic. Note that we could have learned the variance \(\sigma^2\) as well by having the model predict it too! \[ \mu = f_1(\bm{x}; \bm{\phi}), \quad \sigma^2 = f_2(\bm{x}; \bm{\phi}). \] which results in the loss function \[ \mc{L}(\bm{\phi}) = \sum_{i=1}^I \left( \frac{(y_i - f_1(\bm{x}_i; \bm{\phi}))^2}{2f_2(\bm{x}_i; \bm{\phi})} + \frac{1}{2} \log(2\pi f_2(\bm{x}_i; \bm{\phi})) \right). \]
Example 2: Binary classification
- Choose a suitable prob. distribution \(\P(\bm{y} \mid \bm{\theta})\) defined over the domain of the predictions \(\bm{y}\) with distribution parameters \(\bm{\theta}\).
- Domain: \(y \in \{0, 1\}\)
- Bernoulli distribution:
- One parameter \(\lambda \in [0, 1]\) (probability of class \(y=1\)) \[ \P(y \mid \lambda) = \lambda^y (1 - \lambda)^{1-y} \]
- Set the ML model \(f(\bm{x}; \bm{\phi})\) to predict one or more of these parameters so \(\bm{\theta} = f(\bm{x}; \bm{\phi})\).
- Output of the neural net is unconstrained, whereas the parameter \(\lambda \in [0, 1]\).
- Pass through function that maps \(\R \to [0, 1]\): \[ \lambda = \operatorname{sig}(z) = \frac{1}{1 + \exp(-z)}, \quad z = f(\bm{x}; \bm{\phi}). \]
\[ \P(y \mid \lambda) = \lambda^y (1 - \lambda)^{1-y} = \operatorname{sig}(f(\bm{x}; \bm{\phi}))^y (1 - \operatorname{sig}(f(\bm{x}; \bm{\phi})))^{1-y}. \]
- To train the model, find the network parameters \(\hat{\bm{\phi}}\) that minimize the negative log-likelihood loss function over the training dataset pairs \(\{\bm{x}_i, y_i\}\)
- This is the binary cross-entropy loss. \[ \begin{aligned} \begin{split} \hat{\bm{\phi}} &= \argmin_\phi \; \mc{L}(\bm{\phi}) \\ &= \argmin_\phi \left[ -\sum_{i=1}^I \log \left( \P(y_i \mid f(\bm{x}_i ; \bm{\phi})) \right) \right] \\ &= \argmin_\phi \left[ -\sum_{i=1}^I \left( y_i \log(\operatorname{sig}(f(\bm{x}_i; \bm{\phi}))) + (1 - y_i) \log(1 - \operatorname{sig}(f(\bm{x}_i; \bm{\phi}))) \right) \right] \end{split} \end{aligned} \]
- To perform inference for a new test example \(\bm{x}\), return the class with the highest predicted probability \[ \hat{y} = \begin{cases} 1, & \text{if } \operatorname{sig}(f(\bm{x}; \hat{\bm{\phi}})) \geq 0.5, \\ 0, & \text{otherwise}. \end{cases} \]
Example 3: Multiclass classification
- Choose a suitable prob. distribution \(\P(\bm{y} \mid \bm{\theta})\) defined over the domain of the predictions \(\bm{y}\) with distribution parameters \(\bm{\theta}\).
- Domain: \(y \in \{1, 2, \ldots, K\}\)
- Categorical distribution
- \(K\) parameters \(\lambda_k \in [0, 1]\), \(k=1, \ldots, K\) (probability of class \(k\)), with \(\sum_{k=1}^K \lambda_k = 1\) \[ \P(y = k \mid \lambda_1, \ldots, \lambda_K) = \lambda_k \]
- Set the ML model \(f(\bm{x}; \bm{\phi})\) to predict one or more of these parameters so \(\bm{\theta} = f(\bm{x}; \bm{\phi})\).
- Output of the neural net is unconstrained, whereas the parameters \(\lambda_k \in [0, 1]\) with \(\sum_{k=1}^K \lambda_k = 1\).
- Pass through function that maps \(\R^K \to [0, 1]^K\) with sum 1: the softmax function \[ \P(y=k \mid \bm{x}) = \lambda_k = \operatorname{softmax}_k(\bm{z}) = \frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)}, \quad z_k = f_k(\bm{x}; \bm{\phi}). \]
- To train the model, find the network parameters \(\hat{\bm{\phi}}\) that minimize the negative log-likelihood loss function over the training dataset pairs \(\{\bm{x}_i, y_i\}\)
- This is the categorical (multiclass) cross-entropy loss. \[ \begin{aligned} \begin{split} \hat{\bm{\phi}} &= \argmin_\phi \; \mc{L}(\bm{\phi}) \\ &= \argmin_\phi \left[ -\sum_{i=1}^I \log \left( \P(y_i \mid f(\bm{x}_i ; \bm{\phi})) \right) \right] \\ &= \argmin_\phi \left[ -\sum_{i=1}^I \log \left( \operatorname{softmax}_{y_i}(f(\bm{x}_i; \bm{\phi})) \right) \right] \\ &= \argmin_\phi \left[ -\sum_{i=1}^I \log \left( \frac{\exp(f_{y_i}(\bm{x}_i; \bm{\phi}))}{\sum_{j=1}^K \exp(f_j(\bm{x}_i; \bm{\phi}))} \right) \right] \\ &= \argmin_\phi \left[ -\sum_{i=1}^I \left( f_{y_i}(\bm{x}_i; \bm{\phi}) - \log \left( \sum_{j=1}^K \exp(f_j(\bm{x}_i; \bm{\phi})) \right) \right) \right] \end{split} \end{aligned} \]
- To perform inference for a new test example \(\bm{x}\), return the class with the highest predicted probability
- Choose the class with the largest probability. \[ \hat{y} = \argmax_{k=1, \ldots, K} \; \operatorname{softmax}_k(f(\bm{x}; \hat{\bm{\phi}})). \]
Other data types
| Data Type | Domain | Distribution | Use |
|---|---|---|---|
| univariate, continuous, unbounded | \(y \in \mathbb{R}\) | univariate normal | regression |
| univariate, continuous, unbounded | \(y \in \mathbb{R}\) | Laplace or t-distribution | robust regression |
| univariate, continuous, unbounded | \(y \in \mathbb{R}\) | mixture of Gaussians | multimodal regression |
| univariate, continuous, bounded below | \(y \in \mathbb{R}^+\) | exponential or gamma | predicting magnitude |
| univariate, continuous, bounded | \(y \in [0, 1]\) | beta | predicting proportions |
| multivariate, continuous, unbounded | \(\mathbf{y} \in \mathbb{R}^K\) | multivariate normal | multivariate regression |
| univariate, continuous, circular | \(y \in (-\pi, \pi]\) | von Mises | predicting direction |
| univariate, discrete, binary | \(y \in \{0, 1\}\) | Bernoulli | binary classification |
| univariate, discrete, bounded | \(y \in \{1, 2, \dots, K\}\) | categorical | multiclass classification |
| univariate, discrete, bounded below | \(y \in \{0, 1, 2, 3, \dots\}\) | Poisson | predicting event counts |
| multivariate, discrete, permutation | \(\mathbf{y} \in \text{Perm}[1, 2, \dots, K]\) | Plackett-Luce | ranking |
Cross-entropy loss
The distance between two probability distributions \(q(z)\) and \(p(z)\) can be measured using the Kullback-Leibler (KL) divergence \[ \begin{split} \mathrm{D}_{\mathrm{KL}}(q \parallel p) &= \int q(z) \log\left( \frac{q(z)}{p(z)} \right) dz \\ &= \int q(z) \log{q(z)} dz - \int q(z) \log{p(z)} dz = -\mathbb{E}_{z \sim q}[\log p(z)] + \mathbb{E}_{z \sim q}[\log q(z)]. \end{split} \]
When we observe empirical data distribution at points \(\{y_i\}_{i=1}^I\), we can represent this using the empirical distribution \[ q(y) = \frac{1}{I} \sum_{i=1}^I \delta(y - y_i), \] where \(\delta(\cdot)\) is the Dirac delta function.
The KL divergence between the empirical distribution and a model distribution \(p(y \mid \bm{\theta})\) is \[ \mathrm{D}_{\mathrm{KL}}(q \parallel p) = -\mathbb{E}_{y \sim q}[\log p(y \mid \bm{\theta})] + \mathbb{E}_{y \sim q}[\log q(y)] = -\frac{1}{I} \sum_{i=1}^I \log p(y_i \mid \bm{\theta}) + \text{const.} \]
Hence, minimizing the KL divergence is equivalent to maximizing the log-likelihood! Indeed, \[ \hat{\bm{\theta}} = \argmin_{\bm{\theta}} \; \mathrm{D}_{\mathrm{KL}}(q \parallel p) = \argmin_{\bm{\theta}} \left[ -\frac{1}{I} \sum_{i=1}^I \log p(y_i \mid \bm{\theta}) \right] = \argmax_{\bm{\theta}} \left[ \sum_{i=1}^I \log p(y_i \mid \bm{\theta}) \right] \]
- When we train a neural network, the distribution parameters \(\bm{\theta}\) are computed by the mode \(f(\bm{x}; \bm{\phi})\), so we have \[ \hat{\bm{\phi}} = \argmin_{\bm{\phi}} \; \mathrm{D}_{\mathrm{KL}}(q \parallel p) = \argmin_{\bm{\phi}} \left[ -\frac{1}{I} \sum_{i=1}^I \log p(y_i \mid f(\bm{x}_i; \bm{\phi})) \right]. \]
