$$ % Define your custom commands here \newcommand{\bmat}[1]{\begin{bmatrix}#1\end{bmatrix}} \newcommand{\E}{\mathbb{E}} \newcommand{\P}{\mathbb{P}} \newcommand{\S}{\mathbb{S}} \newcommand{\R}{\mathbb{R}} \newcommand{\S}{\mathbb{S}} \newcommand{\norm}[2]{\|{#1}\|_{{}_{#2}}} \newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\pdd}[2]{\frac{\partial^2 #1}{\partial #2^2}} \newcommand{\vectornorm}[1]{\left|\left|#1\right|\right|} \newcommand{\abs}[1]{\left|{#1}\right|} \newcommand{\mbf}[1]{\mathbf{#1}} \newcommand{\mc}[1]{\mathcal{#1}} \newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\nicefrac}[2]{{}^{#1}\!/_{\!#2}} \newcommand{\argmin}{\operatorname*{arg\,min}} \newcommand{\argmax}{\operatorname*{arg\,max}} $$

Week 2: Introduction to Learning

Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]

Supervised Learning

Define a mapping from input to output
Learn this mapping from paired input/output data examples

Regression

Graph Regression

Text Classification

Music Genre Classification

Image Classification

What is a supervised learning model?

An equation relating input (age) to output (height)
Search through a family of possible equations to find one that fits training data well
Deep neural networks are just a very flexible family of equations
Fitting deep neural networks = “Deep Learning”

Image Segmentation

Multivariate binary classification problem (many outputs, two discrete classes)
Convolutional encoder-decoder network

Depth Estimation

Multivariate regression problem (many output, continuous)
Convolutional encoder-decoder network

Pose Estimation

Multivariate regression problem (many output, continuous)
Convolutional encoder-decoder network

Terms

Regression: continuous numbers as output
Classification: discrete classes as output
Two class and multiclass classification are treated somewhat differently.
Univariate: one output
Multivariate: more than one output

Translation

Image Captioning

Image Generation from Text

All these examples have in common:

Very complex relationship between input and ouput
Sometimes there may be many possible valid answers
However, outputs (and sometimes inputs) obey certain rules (e.g. grammar)

The idea might be

Learn the “grammar” of the data from unlabeled examples
Here we can use a gargantuan amount of data (unlabeled)
This makes the supervised learning task easier by having a lot of knowledge of possible outputs

Unsupervised Learning

Learning about a dataset without labels: e.g. clustering
Generative models can create examples: e.g. diffusion models
PGMs learn distribution over data: e.g. VAEs, normalizing flows

Latent Variables

Reinforcement Learning

States are valid positions of the chess board
Actions at a given position are valid possible moves
Positive rewards for taking pieces, negative rewards for losing them
Goal: take actions to change the state so that you receive rewards
You don’t receive any data - you have to explore the environment yourself to gather data as you go.

Why is this difficult?

Stochastic
- Make the same move twice, the opponent might not do the same thing
- Rewards also stochastic (opponent does or does not take your piece)
Temporal credit assignment problem
- Did we get the reward b/c of this move or b/c/ we made good tactical decisions somewhere in the past?
Exploration-exploitation trade-off
- If we found a good opening, should we stick to it?
- Should we try other things, hoping for something better?

Supervised learning model defined a mapping from one or more inputs to one or more outputs
The model is a family of mathematical equations
Computing the outputs given the inputs is termed inference

Example

Input: age and milage of a second-hand Toyota Prius
Output: estimated price of car

Model also includes parameters
Parameters affect the outcome of the model (equation)
Training a model: finding parameters that predict the outputs “well”
- Given inputs for a training dataset of input/output pairs

Notation

Input: $\bm{x} \in \R^n$
Output: $\bm{y} \in \R^m$
Model: $\bm{y} = \bm{f}(\bm{x}; \bm{\phi})$
Parameters: $\bm{\phi} \in \R^p$

Loss function

Training dataset: $I$ pairs of input/output examples: $\left\{ \bm{x}_i, \bm{y}_i \right\}_{i=1}^I$
Loss function or cost function measures how bad the model is: \[ \mc{L}\left(\bm{\phi}, \underbrace{\bm{f}(\bm{x}; \bm{\phi})}_{\text{model}}, \left\{ \bm{x}_i, \bm{y}_i \right\}_{i=1}^I\right) \]

or for short $\mc{L}(\bm{\phi})$

Training

Find the parameters that minimize the loss: \[ \hat{\bm{\phi}} = \argmin_{\bm{\phi}}\; \mc{L}(\bm{\phi}) \]

Testing

To test the model, run on a separate test dataset of input/output pairs
See how well it generalizes to new data

Regression

Univariate regression problem (one output, real value)
Fully connected network

Example

Loss function \[ \mathcal{L}(\bm{\phi}) = \sum_{i=1}^I \left( f(x_i; \bm{\phi}) - y_i \right)^2 = \sum_{i=1}^I \left(\phi_0 + \phi_1x_i - y_i \right)^2 \]

Training performed by gradient descent

Observations

1D regression model is obviously limited
- Want to be able to describe input/output that are not lines
- Want multiple inputs
- Want multiple outputs
Shallow neural nets
- Flexible enough to describe arbitrarily complex input/output mappings
- Can have as many inputs and outputs as we want

Example shallow neural net

\[ y = f(x; \bm{\phi}) = \phi_0 + \phi_1 \sigma(\theta_{10} + \theta_{11}x) + \phi_2\sigma(\theta_{20} + \theta_{21}x) + \phi_3\sigma(\theta_{30} + \theta_{31}x), \tag{1}\]

where $\sigma: \R \to \R$ is the rectified linear unit, which is a particular kind of activation function

\[ \sigma(z) = \operatorname{ReLU}(z) = \begin{cases} 0, & z < 0 \\ z, & z \geq 0 \end{cases} \]

This model has $10$ parameters: $\bm{\phi} = \{\phi_0, \phi_1, \phi_2, \phi_3, \theta_{10}, \theta_{11}, \theta_{20}, \theta_{21}, \theta_{30}, \theta_{31}\}$.

Represents a family of functions
Parameters determine particular function
Given parameters, can perform inference (run equation)
Given training set $\{\bm{x}_i, \bm{y}_i\}_{i=1}^I$, define loss function $\mc{L}$
Change parameters to minimize loss function

Piecewise linear functions with three joints

Hidden units

Let us break down Equation 1 into two parts:

\[ y = \phi_0 + \phi_1h_1 + \phi_2h_2 + \phi_3h_3, \]

where

\[ \begin{split} h_1 &= \sigma(\theta_{10} + \theta_{11}x) \\ h_2 &= \sigma(\theta_{20} + \theta_{21}x) \\ h_3 &= \sigma(\theta_{30} + \theta_{31}x) \end{split} \]

Shaded region: units 1 and 3 are active while unit 2 is inactive.

Depicting neural nets

a) Computation flows from left to right. Each parameter multiplies its source and adds the result to its target. b) Typically, the intercepts, ReLU functions, and parameter names are ommited

Universal approximation theorem

Generalizing from three hidden units to $D$ hidden units, we have the $d^{\text{th}}$ hidden unit is described by

\[ h_d = \sigma(\theta_{d0} + \theta_{d1}x), \]

and these are combined linearly (more precisely affinely) to create the output:

\[ y = \phi_0 + \sum_{d=1}^D \phi_dh_d. \]

Universal approximation theorem

With enough network capacity (number of hidden units), a shallow network can describe any continuous function on a compact subset of $\R^D$ to arbitrary precision.

Two inputs

2 inputs, 3 hidden units, 1 output

\[ \begin{split} h_1 &= \sigma(\theta_{10} + \theta_{11}x_1 + \theta_{12}x_2) \\ h_2 &= \sigma(\theta_{20} + \theta_{21}x_1 + \theta_{22}x_2) \\ h_3 &= \sigma(\theta_{30} + \theta_{31}x_1 + \theta_{32}x_2) \end{split} \]

\[ y = \phi_0 + \phi_1h_1 + \phi_2h_2 + \phi_3h_3 \]

a-c) The input to each hidden unit is a linear function of the two inputs, which corresponds to an oriented plane. Brightness indicates function output. d-f) Each plane is clipped by the ReLU activation function (cyan lines are equivalent to “joints” in the 1D version). g-i) The clipped planes are then weighted, and j) summed together with an offset that determines the overall height of the surface. The result is a continuous surface made up of convex piecewise linear polygonal regions.

Question

What if there were two outputs, what would the would the contour lines of the second output look like as compared to j) above?

General case

$D_o$ outputs, $D$ hidden units, $D_i$ inputs

\[ h_d = \sigma\left(\theta_{d0} + \sum_{i=1}^{D_i}\theta_{di}x_i\right) \]

\[ y_j = \phi_{j0} + \sum_{d=1}^D\phi_{jd}h_d \]

Three inputs, three hidden units, two outputs

Number of output regions

In general, each output consists of $D$ dimensional convex polytopes
With two inputs, and three outputs, we saw there were seven polygons.

a) 1D input with 1 hidden unit creates two regions (one joint) b) 2D input with 2 hidden units creates four regions (two lines) c) 3D input with 3 hidden units creates eight regions (three planes) It follows that with D_i input dim. and D_i hidden units, the output will be divided into D_i hyperplanes to create 2^{D_i} polytopes — a) 1D input with 1 hidden unit creates two regions (one joint) b) 2D input with 2 hidden units creates four regions (two lines) c) 3D input with 3 hidden units creates eight regions (three planes) It follows that with $D_i$ input dim. and $D_i$ hidden units, the output will be divided into $D_i$ hyperplanes to create $2^{D_i}$ polytopes

Zaslavsky (1975)

Number of regions created by $D > D_i$ polytopes in $D_i$ dimensions is \[ \sum_{j=0}^{D_i} \binom{D}{j}. \] Note that \[ 2^{D_i} < \sum_{j=0}^{D_i} \binom{D}{j} < 2^D. \]

a) Maximum possible regions as a funcion of the number of hidden units for five different input dimensions b) The same data plotted as a function of the number of parameters Solid circles represent the same model in both panels with 500 hidden units. — a) Maximum possible regions as a funcion of the number of hidden units for five different input dimensions b) The same data plotted as a function of the number of parameters Solid circles represent the same model in both panels with $500$ hidden units.

Terminology

y-offsets: biases
slopes: weights
every node connected every other: fully connected network
no loops: feedforward network
values after the activation functions: activations
values before the activation functions: preactivations
one hidden layer: shallow neural network
more than one hidden layer: deep neural network
number of hidden units $\approx$ capacity

Supervised Learning

Regression

Graph Regression

Text Classification

Music Genre Classification

Image Classification

What is a supervised learning model?

Image Segmentation

Depth Estimation

Pose Estimation

Translation

Image Captioning

Image Generation from Text

Unsupervised Learning

Latent Variables

Reinforcement Learning

Why is this difficult?

Notation

Loss function

Training

Testing

Regression

Example

Example shallow neural net

Hidden units

Depicting neural nets

Universal approximation theorem

Two inputs

General case

Number of output regions

Terminology

Other activation functions