$$ % Define your custom commands here \newcommand{\bmat}[1]{\begin{bmatrix}#1\end{bmatrix}} \newcommand{\E}{\mathbb{E}} \newcommand{\P}{\mathbb{P}} \newcommand{\S}{\mathbb{S}} \newcommand{\R}{\mathbb{R}} \newcommand{\S}{\mathbb{S}} \newcommand{\norm}[2]{\|{#1}\|_{{}_{#2}}} \newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\pdd}[2]{\frac{\partial^2 #1}{\partial #2^2}} \newcommand{\vectornorm}[1]{\left|\left|#1\right|\right|} \newcommand{\abs}[1]{\left|{#1}\right|} \newcommand{\mbf}[1]{\mathbf{#1}} \newcommand{\mc}[1]{\mathcal{#1}} \newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\nicefrac}[2]{{}^{#1}\!/_{\!#2}} \newcommand{\argmin}{\operatorname*{arg\,min}} \newcommand{\argmax}{\operatorname*{arg\,max}} \newcommand{\dd}{\operatorname{d}\!} $$

Residual Networks

Each network layer computes an additive change to the current representation, instead of transforming it directly.
- Allows deeper networks to be trained.
- Causes an exponential increase in the activation magnitudes at initialization.
Various normalization methods are used to control the magnitude of the activations.
Residual blocks with batch normalization allow much deeper networks to be trained.

Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]

A three-layer network is defined by \[ \begin{aligned} \bm{h}_1 &= f_1(\bm{x}; \bm{\phi}_1) \\ \bm{h}_2 &= f_2(\bm{h}_1; \bm{\phi}_2) \\ \bm{h}_3 &= f_3(\bm{h}_2; \bm{\phi}_3) \\ \bm{y} &= f_4(\bm{h}_3; \bm{\phi}_4) \end{aligned} \]
Since the processing is equential, we can equivalently think of this network as a series of nested functions \[ \bm{y} = f_4\left(f_3\left(f_2\left(f_1(\bm{x}; \bm{\phi}_1); \bm{\phi}_2\right); \bm{\phi}_3\right); \bm{\phi}_4\right) \]

Sequential processing. Standard neural networks pass the output of each layer directly into the next layer.

In principle, we can add as many layers as we want
- But in practice, the performance saturates and then degrades
Indeed, the degradation is present for both the training and test set.
- Implies that the problem is training deeper networks
- Rather than the inability of deeper networks to generalize.

Decrease in performance when adding more convolutional layers. a) A 20-layer convolutional network outperforms a 56-layer neural network for image classification on the test set of the CIFAR-10 dataset. b) This is also true for the training set, which suggests that the problem relates to training the original network rather than a failure to generalize. — Decrease in performance when adding more convolutional layers. a) A $20$-layer convolutional network outperforms a $56$-layer neural network for image classification on the test set of the CIFAR-10 dataset. b) This is also true for the training set, which suggests that the problem relates to training the original network rather than a failure to generalize.

For a shallow network, the gradient of the output with respect to the input changes slowly as we change the input.
- However, for a deep network, a tiny change in the input results in a completely different gradient.
- This is captured by the autocorrelation function of the gradient.
Nearby gradients are correlated for shallow networks, but this correlation quickly drops to zero for deep networks.
This is termed the shattered gradients phenomenon.

$Shattered gradients. a) Consider a shallow network with 200 hidden units and Glorot initialization for both the weights and biases. The gradient \pd{y}{x} of the scalar network output y with respect to the scalar input x changes relatively slowly as we change the input x. b) For a deep network with 24 layers and 200 hidden units per layer, this gradient changes very quickly and unpredictably. c) The autocorrelation function of the gradient shows that nearby gradients become unrelated (have autocorrelation close to zero) for deep networks. This shattered gradients phenomenon may explain why it is hard to train deep networks. Gradient descent algorithms rely on the loss surface being relatively smooth, so the gradients should be related before and after each update step. ADapted from Balduzzi et al. (2017).$

Shattered gradients. a) Consider a shallow network with $200$ hidden units and Glorot initialization for both the weights and biases. The gradient $\pd{y}{x}$ of the scalar network output $y$ with respect to the scalar input $x$ changes relatively slowly as we change the input $x$. b) For a deep network with $24$ layers and $200$ hidden units per layer, this gradient changes very quickly and unpredictably. c) The autocorrelation function of the gradient shows that nearby gradients become unrelated (have autocorrelation close to zero) for deep networks. This shattered gradients phenomenon may explain why it is hard to train deep networks. Gradient descent algorithms rely on the loss surface being relatively smooth, so the gradients should be related before and after each update step. ADapted from Balduzzi et al. (2017).

Shattered gradients presumably arise because changes in early network layers modify the output in an increasingly complex way as the network becomes deeper.
The derivative of the output with respect to the first layer $f_1$ of the network is \[ \pd{\bm{y}}{\bm{f}_1} = \pd{\bm{f}_4}{\bm{f}_3} \pd{\bm{f}_3}{\bm{f}_2} \pd{\bm{f}_2}{\bm{f}_1} \]
When we change the parameters that determine $\bm{f}_1$:
- all of the derivatives in this sequence are evaluated at slightly different locations
- since layers $\bm{f}_2, \bm{f}_3, \bm{f}_4$ depend on $\bm{f}_1$
Consequently, the updated gradient at each training example may be completely different, and the loss function becomes badly behaved.

Residual connections and residual blocks

Residual or skip connections are branches in the computational path, whereby the input to each network layer $\bm{f}(\cdot)$ is added back to the output \[ \begin{aligned} \bm{h}_1 &= \bm{x} + \bm{f}_1(\bm{x}; \bm{\phi}_1) \\ \bm{h}_2 &= \bm{h}_1 + \bm{f}_2(\bm{h}_1; \bm{\phi}_2) \\ \bm{h}_3 &= \bm{h}_2 + \bm{f}_3(\bm{h}_2; \bm{\phi}_3) \\ \bm{y} &= \bm{h}_3 + \bm{f}_4(\bm{h}_3; \bm{\phi}_4), \end{aligned} \] where the first term on the right-hand side of each line is the residual connection.
Each function $\bm{f}_k$ learns an additive change to the current representation.
- It follows that their outputs must be the same size as their inputs.
Each additive combination of the input and the processed output is known as a residual block or residual layer.
If we write this as a single function by substituting in the expressions for the intermediate quantities $\bm{h}_k$: \[ \begin{aligned} \bm{y} = \bm{x} &+ \bm{f}_1(\bm{x}) \\ &+ \bm{f}_2\left(\bm{x} + \bm{f}_1(\bm{x})\right) \\ &+ \bm{f}_3\left(\bm{x} + \bm{f}_1(\bm{x}) + \bm{f}_2\left(\bm{x} + \bm{f}_1(\bm{x})\right)\right), \\ &+ \bm{f}_4\left(\bm{x} + \bm{f}_1(\bm{x}) + \bm{f}_2\left(\bm{x} + \bm{f}_1(\bm{x})\right) + \bm{f}_3\left(\bm{x} + \bm{f}_1(\bm{x}) + \bm{f}_2\left(\bm{x} + \bm{f}_1(\bm{x})\right)\right)\right), \end{aligned} \tag{1}\] where we have omitted the parameters $\bm{\phi}_k$ for brevity.

Residual connections. a) The output of each function \bm{f}_k(\bm{x}; \bm{\phi}_k) is added back to its input, which is passed via a parallel computational path called a residual or skip connection. Hence, the function computes an additive change to the representation. b) Upon expanding (unraveling) the network equations, we find that the output is the sum of the input plus four smaller network (depicted in whtie, orange, gray, and cyan, respectively, and corresponding to terms in Equation 1); we can think of this as an ensemble of networks. Moreover, the output from the cyan network is itself a transformation \bm{f}_4(\cdot; \bm{\phi}_4) of another ensemble, and so on. Alternatively, we can consider the network as a combination of 16 different paths through the computational graph. One example is the dashed path from input \bm{x} to output \bm{y}, which is the same in panels (a) and (b). — Residual connections. a) The output of each function $\bm{f}_k(\bm{x}; \bm{\phi}_k)$ is added back to its input, which is passed via a parallel computational path called a residual or skip connection. Hence, the function computes an additive change to the representation. b) Upon expanding (unraveling) the network equations, we find that the output is the sum of the input plus four smaller network (depicted in whtie, orange, gray, and cyan, respectively, and corresponding to terms in Equation 1); we can think of this as an ensemble of networks. Moreover, the output from the cyan network is itself a transformation $\bm{f}_4(\cdot; \bm{\phi}_4)$ of another ensemble, and so on. Alternatively, we can consider the network as a combination of $16$ different paths through the computational graph. One example is the dashed path from input $\bm{x}$ to output $\bm{y}$, which is the same in panels (a) and (b).

The final network output is a sum of the input and four smaller networks corresponding to each line of the equation
- One interpretation is that residual connections turn the original network into an ensemble of these smaller networks.
- These are summed to compute the result.
A complementary way of thinking is that it creates sixteen paths with differing numbers of transformations between input and output.
- The first function $\bm{f}_1(\bm{x})$ occurs in eight of these sixteen paths, including as a direct additive term. \[ \pd{\bm{y}}{\bm{f}_1} = \bm{I} + \pd{\bm{f}_2}{\bm{f}_1} + \left( \pd{\bm{f}_3}{\bm{f}_1} + \pd{\bm{f}_2}{\bm{f}_1} \pd{\bm{f}_3}{\bm{f}_2} \right) + \left( \pd{\bm{f}_4}{\bm{f}_1} + \pd{\bm{f}_2}{\bm{f}_1}\pd{\bm{f}_4}{\bm{f}_2} + \pd{\bm{f}_3}{\bm{f}_1}\pd{\bm{f}_4}{\bm{f}_3} + \pd{\bm{f}_2}{\bm{f}_1} \pd{\bm{f}_3}{\bm{f}_2} \pd{\bm{f}_4}{\bm{f}_3} \right) \] where there is one term for each of the eight paths.
The identity term on the right-hand side shows that changes in the parameters $\bm{\phi}_1$ in the first layer $\bm{f}_1(\bm{x}; \bm{\phi}_1)$ contribute directly to changes in the network output $\bm{y}$.
- They also contribute indirectly through the other chans of derivatives of varying lengths.
- In general, gradients through shorter paths will be better behaved.
Since both the identity term and various short chains of derivatives will contribute to the derivative for each layer, networks with residual links suffer less from shattered gradients.

Order of operations in residual blocks. a) The usual order of linear transformation or convolution followed by a ReLU nonlinearity means that each residual block can only add nonnegative quantities. b) With the reverse order, both positive and negative quantities can be added. However, we must add a linear transformation at the start of the network in case the input is all negative. c) In practice, it is common for a residual block to contain several network layers.

If in a residual layer, the ReLU function is at the end, the output is nonnegative.
- Therefore, each residual block can only increase the input values.
It is typical to change the order of operations so that the activation function is applied first
- followed by a linear transformation
- there may also be several layers of processing within the residual block
- these are often terminate with a linear transformation.

How deep can we make networks?

Adding residual connections roughly doubles the depth of a network that can be practically trained before performance degrades.
- We still want to increase the depth further
To understand why residual connections do not allow us to incrase the depth arbitrarily, consider
- How the variance of the activations changes during the forward pass?
- How does the gradient magnitudes change during the backward pass?

Exploding gradients in residual networks

Recall that bad weight initialization may lead to
- Exponentially increasing activation magnitudes during the forward pass
- Exponentially increasing gradient magnitudes during the backward pass
We initialize the network parameters so that
- the activations (in the forward pass)
- and the gradients (in the backward pass)
  - remains the same between layers.
In residual networks, the values in the forward pass increase exponentially as we move through the network.
- This is because each residual block adds a nonnegative quantity to the input.
- Therefore, the variance of the activations increases exponentially with depth.
  - When we recombine with the input, the variance doubles!
  - It grows exponentially with the number of residual blocks.
This limits the possible network depth before floating point precision is exceeded in the forward pass.
- Similar argument applies to the gradients in the backward pass of the backprop algorithm.
Hence, residual networks still suffer from unstable forward propagation and exploding gradients even with He intilialization.
- One way to stabilize the forward and backward passes would be to use He initilization
  - and then multiply the combined output of the resdiual block by $\nicefrac{1}{\sqrt{2}}$ to prevent the variance from doubling.
- More usual: use batch normalization.

$Variance in residual networks. a) He initialization ensures that the expected variance remains unchanged after a linear plus ReLU layer \bm{f}_k. Unfortunately, in residual networks, the input of each block is added back to the output, so the variance doubles at each layer (gray numbers indicate variance) and grows exponentially. b) One approach would be to rescale the signal by \nicefrac{1}{\sqrt{2}} between each residual block. c) A second method uses batch normalization (BN) as the first step in the residual block and initializes the associated offset \delta to zero and the scale \gamma to one. This transforms the input to each layer to have unit variance, and with He intitialization, the output variance will also be one. Now the variance increases linearly with the number of residual blocks. A side-effect is that, at initialization, later network layers are dominated by the residual connection and are hence close to computing the identity.$

Variance in residual networks. a) He initialization ensures that the expected variance remains unchanged after a linear plus ReLU layer $\bm{f}_k$. Unfortunately, in residual networks, the input of each block is added back to the output, so the variance doubles at each layer (gray numbers indicate variance) and grows exponentially. b) One approach would be to rescale the signal by $\nicefrac{1}{\sqrt{2}}$ between each residual block. c) A second method uses batch normalization (BN) as the first step in the residual block and initializes the associated offset $\delta$ to zero and the scale $\gamma$ to one. This transforms the input to each layer to have unit variance, and with He intitialization, the output variance will also be one. Now the variance increases linearly with the number of residual blocks. A side-effect is that, at initialization, later network layers are dominated by the residual connection and are hence close to computing the identity.

Batch Normalization

BatchNorm shifts and rescales each activation $h$ so that its mean and variance across the batch $\mc{B}$ becomes values that are learned during training. \[ \begin{aligned} m_h &= \frac{1}{|\mc{B}|} \sum_{i \in \mc{B}} h_i \\ s_h &= \sqrt{\frac{1}{|\mc{B}|} \sum_{i \in \mc{B}} (h_i - m_h)^2}, \end{aligned} \] where all quantities are scalars.
Then, we use these statistics to standardize the batch activations to have mean zero and unit variance: \[ h_i \leftarrow \frac{h_i - m_h}{s_h + \epsilon}, \qquad \forall i \in \mc{B}, \tag{2}\] where $\epsilon$ is a small number that prevents division by zero if $h_i$ is the same for every member of the batch and $s_h = 0$.
Finally, the normalized variable is scaled by $\gamma$ and shifted by $\delta$: \[ h_i \leftarrow \gamma h_i + \delta, \qquad \forall i \in \mc{B}. \]
After this operation, the activations have mean $\delta$ and standard deviation $\gamma$ across all members of the batch.
- Both of these quantities are learned during training.

Batch normalization

Batch normalization is applied independently to each hidden unit.
- In a standard neural net with $K$ layers, each containing $D$ hidden units, there would be $KD$ learned offsets $\delta$ and $KD$ learned scales $\gamma$.
- In a convolutional network, the normalizing stats are computed over both the batch and the spatial position.
  - If there were $K$ layers, each containing $C$ channels, there would be $KC$ learned offsets and $KC$ learned scales.
At test time, we do not have a batch from which we can gather statistics.
- To resolve this, the statistics $m_h$ and $s_h$ are calculated across the whole training dataset and frozen in the final network.

Costs and benefits of batch normalization

Batch normalization makes the network invariant to rescaling the weights and biases that contribute to each activation
- if these are doubled, then the activations also double
- the estimated standard deviation $s_h$ also doubles, and the normalization in Equation 2 compensates for these changes.
Since this happens for each hidden unit, there will be a large family of weights and biases that all produce the same effect.
BatchNorm adds two parameters $\gamma$ and $\delta$ per hidden unit, which makes the model somewhat larger.
- Hence, it both creates redundancy in the weights and biases
- and adds extra parameters to compensate for that redundancy.
This is obviously inefficient, but BatchNorm also provides several benefits.

Stable forward propagation

If we intiialize the offsets $\delta$ to zero and the scales $\gamma$ to one,
- then each output activation will have unit variance.
- in a regular neural net, this ensures the variance is stable during forward prop at initialization.
- in a resnet, the variance must still increase, but the $k^{\text{th}}$ layer adds one unit of variance, instead of $k$.
At initialization, this has a side-effect:
- later layers make a smaller change to the overall variation than earlier ones.
- the network is effectively less deep at the start of training since later layers are close to computing the identity.
- as training proceeds, the network can increase the scales $\gamma$ in later layers and can control its own effective depth.

Higher learning rates

Empirical studies and theory both show that batch normalization makes the loss surface and its gradient change more smoothly. - This allows the use of higher learning rates as the surface is more predictable.

Regularization

Noise in the training process can improve generalization by preventing the network from overfitting to the training data.
- This is the idea behind regularization methods such as weight decay and dropout.
Batch normalization adds noise to the training process because the statistics $m_h$ and $s_h$ are estimated from a random batch of training examples.

ResNet

The ResNet-200 model contains $200$ layers and was used for image classification on the ImageNet database.
The resolution is decreased between adjacent ResNet blocks using convolutiosn with stride two.
Channels are added by either appending zeros to the representation or applying an extra $1 \times 1$ convolution.
At the start of the network is a $7 \times 7$ convolutional layer, followed by downsampling operation.
At the end, a fully connected layer maps the block to a vector of length $1000$.
- passed through a softmax layer to generate class probabilities.

ResNet-200 model. A standard 7 \times 7 convolutional layer with stride two is applied, followed by a MaxPool operation. A series of bottlenect residual blocks follow (number in brackets is channels after first 1 \times 1 convolution), with periodic downsampling and accompanying increases in the number of channels. The network concludes with average pooling across all spatial positions and a fully connected layer maps to pre-softmax activation. — ResNet-200 model. A standard $7 \times 7$ convolutional layer with stride two is applied, followed by a MaxPool operation. A series of bottlenect residual blocks follow (number in brackets is channels after first $1 \times 1$ convolution), with periodic downsampling and accompanying increases in the number of channels. The network concludes with average pooling across all spatial positions and a fully connected layer maps to pre-softmax activation.

The ResNet-200 model achieved a remarkable $4.8\%$ error rate for the correct class, being in the top five and $20.1\%$ for identifying the correct calss correctly.
- Compare this to AlexNet ($16.4\%$, $38.1\%$) and VGG ($6.8\%$, $23.7\%$).
ResNet-200 was one of the first networks to exceed human performance ($5.1\%$ for being in the top five guesses).
Still, this was state-of-the-art in 2016.
- Today, the best perofmring model has a $9.0\%$ error for identifying the class correctly.
- These newer models are now based on transformers.

ImageNet performance. Each circle represents a different published model. Blue circles represent models that were state-of-the-art. AlexNet and VGG were remarkable for their time, but are now far from the state-of-the-art. ImageGPT, ViT, SWIN, and DaViT are transformer-based architectures.

DenseNet

Residual blocks receive output from the previous layer, modify it by passing it through some network layers, and add it back to the original input.
Alternative: concatenate the modified and the original signals.
- Increases the representation size (number of channels).
- Optional subsequent linear transformation can map back to the original size ($1 \times 1$ convolution).
- Allows the model to add the representations together, take a weighted sum, or combine them in a more complex way.

In practice, concatenating the representations can only be sustained for a few layers
- The number of channels (hence the number of parameters to process them) becomes increasingly large.
- Problem is alleviated by applying a $1 \times 1$ convolution to reduce the number of channels before the next $3 \times 3$ convolution is applied.
In a CNN, the input is periodically downsampled.
- Concatenation across the downsampling makes no sense (different representation sizes)
- The chain of concatenation is broken at this point and a smaller representation starts a new chain.

U-Nets and hourglass networks

In U-Nets, the encoder repeatedly downsamples the image until the receptive fields are large and information is integrated across the image.
Then, the decoder upsamples it back to the size of the origianl image.
The final output is a probability over possible object classes at each pixel.
One drawback of the said architecture is that the low-resolution reprsentation in the middle must “remember” the high-resolution details to make the final result accurate.
- This is unnecessary if residual connects transfer the representations from the encoder to their partner in the decoder.

U-Net for segmenting HeLa cells. The U-Net has an encoder-decoder structure, in which the representation is downsampled (orange blocks) and then re-upsampled (blue blocks). The encoder uses regular convolutions, and the decoder uses transposed convolutions. Residual connections append the last prersentation at each scale in the encoder to the first reprsentation at the same scale in the decoder (orange arrows). The original U-Net used “valid” convolutions, so the size decreased slightly with each layer, even without downsampling. Hence, the representations from the encoder were cropped (dashed squares) before appending to the decoder.

Segmentation using U-Net in 3D. a) Three slices through a 3D volume of mouse cortex taken by scanning electron microscope. b) A single U-Net is used to classify voxels as being inside or outside neurites. Connected regions are identified with differnt colors. c) For a better result, an ensemble of five U-Nets is trained, and a voxel is only classified as belonging to the cell if all five networks agree. — Segmentation using U-Net in $3$D. a) Three slices through a $3$D volume of mouse cortex taken by scanning electron microscope. b) A single U-Net is used to classify voxels as being inside or outside neurites. Connected regions are identified with differnt colors. c) For a better result, an ensemble of five U-Nets is trained, and a voxel is only classified as belonging to the cell if all five networks agree.

Hourglass networks are similar to U-Nets but they
- apply further convolutional alyers in the skip connections
- add the result back to the decoder rather than concatenating it.
A series of these models form a stacked hourglass network
- they alternate between considering the image at local and global levels.

Stacked hourglass networks for pose estimation. a) The network input is an image containing a person, and the output is a set of heatmaps, with one heatmap for each joint. This is formulated as a regression problem where the targets are heatmap images with small, highlighted regions at the ground-truth joint positions. The peak of the estimated heatmap is used to establish each final joint position. b) The architecture consists of initial convolutional and residual layers followed by a series of hourglass blocks. c) Each hourglass block consists of an encoder-decoder network similar to the U-Net except that the convolutions use zero-padding, some further processing is done in the residual links, and these links add this processed representation rather than concatenate it. Each blue cuboid is itself a bottleneck residual layer.

Why residual connections help

Visualizing neural network loss surfaces. Each plot shows the loss surface in two random directions in parameter space around the minimum found by SGD for an image classification task on the CIFAR-10 dataset. These directrions are normalized to facilitate side-by-side comparison. a) Residual net with 56 layers. b) Results from the same network without skip connections. The surface is smoother with the skip connections. This facilitates learning and makes the final network performance more robust to minor errors in the parameters, so will likely generalize better. — Visualizing neural network loss surfaces. Each plot shows the loss surface in two random directions in parameter space around the minimum found by SGD for an image classification task on the CIFAR-10 dataset. These directrions are normalized to facilitate side-by-side comparison. a) Residual net with $56$ layers. b) Results from the same network without skip connections. The surface is smoother with the skip connections. This facilitates learning and makes the final network performance more robust to minor errors in the parameters, so will likely generalize better.

Batch normalization helps stabilize the forward propagation of signals through a network
- It is shows that causes gradient explosion in ReLU networks without skip connections
- Each layer increasing the magnitude of the gradients by $\sqrt{\nicefrac{\pi}{\pi-1}} \approx 1.21$.
- This effect is also present in residual networks.
Still, the benefit of removing the $2^K$ increases in magnitude in the forward pass outweighs the harm done by increasing gradients by $1.21^K$ in the backward pass.

Normalization schemes. BatchNorm modifies each channel separately but adjusts each batch member in the same way based on statistics gathered across the batch and spatial position. Ghost batch normalization computes these statistics from only part of the batch to make them more variable. LayerNorm computes statistics for each batch member separately, based on statistics gathered across the channels and spatial position. It retains a separate learned scaling factor for each channel. GroupNorm normalizes within each group of channels and also retains a separate scale and offset parameter for each channel. InstanceNorm normalizes within each channel separately, computing the statistics only across spatial position.

Ghost batch normalization (GhostNorm)

Uses only part of the batch to compute the normalization statistics
- This makes them noisier
- increasing the amount of regularization when the batch size is very large.

When the batch size is very small or the fluctuations within a batch are very large (e.g. NLP), the statistics in BatchNorm may become unreliable.
Batch renormalization has been proposed to address this issue.
- It keeps track of running statistics for the mean and variance of the activations
Another problem with batch normalization is that it is unsuitable for use in recurrent neural networks.
- These are networks for processing sequences, in which the previous output is fed back as an additional input as we move through the sequence.
- Here, the statistics must be stored at each step in the sequence
- It is unclear what to do if a test sequence is longer than the training sequence.
Third problem: batch normalization needs access to the whole batch
- This may not be readily available when training is distributed across several machines.

Layer normalization (LayerNorm)

LayerNorm avoids using batch statistics by normalizing each data example separately
- uses statistics gathered across the channels and spatial direction.
- there is still a separate learned scale $\gamma$ and offset $\delta$ per channel.

Group normalization (GroupNorm)

GroupNorm is similar to LayerNorm but divides the channels into groups
- computes the statistics for each group separately across the within-group channels and spatial positions.
- there are still separate scale and offset parameters per channel.

Instance normalization (InstanceNorm)

InstanceNorm takes GroupNorm to the extreme, where the number of groups $=$ number of channels.

Why batch normalization helps

BatchNorm helps control the initial gradients in a residual network.
- However, the exact mechanism by which BatchNorm improves performance is not well understood.
It is empirically verified that adding BatchNorm has the effect of smoothing out the loss surface.
- This allows for larger learning rates.
- It was empirially shown that one can exponentially incrase learning rate schedule with BatchNorm.
  - This is because BatchNorm makes the newtwork invariant to the scales of the weight matrices.
It is shown theoretically that for any parameter initialization, the distance to the nearest optimum is less for networks with batch normalization.
Finally, BatchNorm has a regularizing effect due to statistical fluctuations from the random composition of the batch.