Residual Networks
- Each network layer computes an additive change to the current representation, instead of transforming it directly.
- Allows deeper networks to be trained.
- Causes an exponential increase in the activation magnitudes at initialization.
- Various normalization methods are used to control the magnitude of the activations.
- Residual blocks with batch normalization allow much deeper networks to be trained.
Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]
Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]
Image credits: Understanding Deep Learning by Simon J. D. Prince, [CC BY 4.0]
- A three-layer network is defined by \[ \begin{aligned} \bm{h}_1 &= f_1(\bm{x}; \bm{\phi}_1) \\ \bm{h}_2 &= f_2(\bm{h}_1; \bm{\phi}_2) \\ \bm{h}_3 &= f_3(\bm{h}_2; \bm{\phi}_3) \\ \bm{y} &= f_4(\bm{h}_3; \bm{\phi}_4) \end{aligned} \]
- Since the processing is equential, we can equivalently think of this network as a series of nested functions \[ \bm{y} = f_4\left(f_3\left(f_2\left(f_1(\bm{x}; \bm{\phi}_1); \bm{\phi}_2\right); \bm{\phi}_3\right); \bm{\phi}_4\right) \]
- In principle, we can add as many layers as we want
- But in practice, the performance saturates and then degrades
- Indeed, the degradation is present for both the training and test set.
- Implies that the problem is training deeper networks
- Rather than the inability of deeper networks to generalize.
- For a shallow network, the gradient of the output with respect to the input changes slowly as we change the input.
- However, for a deep network, a tiny change in the input results in a completely different gradient.
- This is captured by the autocorrelation function of the gradient.
- Nearby gradients are correlated for shallow networks, but this correlation quickly drops to zero for deep networks.
- This is termed the shattered gradients phenomenon.
- Shattered gradients presumably arise because changes in early network layers modify the output in an increasingly complex way as the network becomes deeper.
- The derivative of the output with respect to the first layer \(f_1\) of the network is \[ \pd{\bm{y}}{\bm{f}_1} = \pd{\bm{f}_4}{\bm{f}_3} \pd{\bm{f}_3}{\bm{f}_2} \pd{\bm{f}_2}{\bm{f}_1} \]
- When we change the parameters that determine \(\bm{f}_1\):
- all of the derivatives in this sequence are evaluated at slightly different locations
- since layers \(\bm{f}_2, \bm{f}_3, \bm{f}_4\) depend on \(\bm{f}_1\)
- Consequently, the updated gradient at each training example may be completely different, and the loss function becomes badly behaved.
Residual connections and residual blocks
- Residual or skip connections are branches in the computational path, whereby the input to each network layer \(\bm{f}(\cdot)\) is added back to the output \[ \begin{aligned} \bm{h}_1 &= \bm{x} + \bm{f}_1(\bm{x}; \bm{\phi}_1) \\ \bm{h}_2 &= \bm{h}_1 + \bm{f}_2(\bm{h}_1; \bm{\phi}_2) \\ \bm{h}_3 &= \bm{h}_2 + \bm{f}_3(\bm{h}_2; \bm{\phi}_3) \\ \bm{y} &= \bm{h}_3 + \bm{f}_4(\bm{h}_3; \bm{\phi}_4), \end{aligned} \] where the first term on the right-hand side of each line is the residual connection.
- Each function \(\bm{f}_k\) learns an additive change to the current representation.
- It follows that their outputs must be the same size as their inputs.
- Each additive combination of the input and the processed output is known as a residual block or residual layer.
- If we write this as a single function by substituting in the expressions for the intermediate quantities \(\bm{h}_k\): \[ \begin{aligned} \bm{y} = \bm{x} &+ \bm{f}_1(\bm{x}) \\ &+ \bm{f}_2\left(\bm{x} + \bm{f}_1(\bm{x})\right) \\ &+ \bm{f}_3\left(\bm{x} + \bm{f}_1(\bm{x}) + \bm{f}_2\left(\bm{x} + \bm{f}_1(\bm{x})\right)\right), \\ &+ \bm{f}_4\left(\bm{x} + \bm{f}_1(\bm{x}) + \bm{f}_2\left(\bm{x} + \bm{f}_1(\bm{x})\right) + \bm{f}_3\left(\bm{x} + \bm{f}_1(\bm{x}) + \bm{f}_2\left(\bm{x} + \bm{f}_1(\bm{x})\right)\right)\right), \end{aligned} \tag{1}\] where we have omitted the parameters \(\bm{\phi}_k\) for brevity.
- The final network output is a sum of the input and four smaller networks corresponding to each line of the equation
- One interpretation is that residual connections turn the original network into an ensemble of these smaller networks.
- These are summed to compute the result.
- A complementary way of thinking is that it creates sixteen paths with differing numbers of transformations between input and output.
- The first function \(\bm{f}_1(\bm{x})\) occurs in eight of these sixteen paths, including as a direct additive term. \[ \pd{\bm{y}}{\bm{f}_1} = \bm{I} + \pd{\bm{f}_2}{\bm{f}_1} + \left( \pd{\bm{f}_3}{\bm{f}_1} + \pd{\bm{f}_2}{\bm{f}_1} \pd{\bm{f}_3}{\bm{f}_2} \right) + \left( \pd{\bm{f}_4}{\bm{f}_1} + \pd{\bm{f}_2}{\bm{f}_1}\pd{\bm{f}_4}{\bm{f}_2} + \pd{\bm{f}_3}{\bm{f}_1}\pd{\bm{f}_4}{\bm{f}_3} + \pd{\bm{f}_2}{\bm{f}_1} \pd{\bm{f}_3}{\bm{f}_2} \pd{\bm{f}_4}{\bm{f}_3} \right) \] where there is one term for each of the eight paths.
- The identity term on the right-hand side shows that changes in the parameters \(\bm{\phi}_1\) in the first layer \(\bm{f}_1(\bm{x}; \bm{\phi}_1)\) contribute directly to changes in the network output \(\bm{y}\).
- They also contribute indirectly through the other chans of derivatives of varying lengths.
- In general, gradients through shorter paths will be better behaved.
- Since both the identity term and various short chains of derivatives will contribute to the derivative for each layer, networks with residual links suffer less from shattered gradients.
- If in a residual layer, the ReLU function is at the end, the output is nonnegative.
- Therefore, each residual block can only increase the input values.
- It is typical to change the order of operations so that the activation function is applied first
- followed by a linear transformation
- there may also be several layers of processing within the residual block
- these are often terminate with a linear transformation.
TipHow deep can we make networks?
- Adding residual connections roughly doubles the depth of a network that can be practically trained before performance degrades.
- We still want to increase the depth further
- To understand why residual connections do not allow us to incrase the depth arbitrarily, consider
- How the variance of the activations changes during the forward pass?
- How does the gradient magnitudes change during the backward pass?
ImportantExploding gradients in residual networks
- Recall that bad weight initialization may lead to
- Exponentially increasing activation magnitudes during the forward pass
- Exponentially increasing gradient magnitudes during the backward pass
- We initialize the network parameters so that
- the activations (in the forward pass)
- and the gradients (in the backward pass)
- remains the same between layers.
- In residual networks, the values in the forward pass increase exponentially as we move through the network.
- This is because each residual block adds a nonnegative quantity to the input.
- Therefore, the variance of the activations increases exponentially with depth.
- When we recombine with the input, the variance doubles!
- It grows exponentially with the number of residual blocks.
- This limits the possible network depth before floating point precision is exceeded in the forward pass.
- Similar argument applies to the gradients in the backward pass of the backprop algorithm.
- Hence, residual networks still suffer from unstable forward propagation and exploding gradients even with He intilialization.
- One way to stabilize the forward and backward passes would be to use He initilization
- and then multiply the combined output of the resdiual block by \(\nicefrac{1}{\sqrt{2}}\) to prevent the variance from doubling.
- More usual: use batch normalization.
- One way to stabilize the forward and backward passes would be to use He initilization
Batch Normalization
- BatchNorm shifts and rescales each activation \(h\) so that its mean and variance across the batch \(\mc{B}\) becomes values that are learned during training. \[ \begin{aligned} m_h &= \frac{1}{|\mc{B}|} \sum_{i \in \mc{B}} h_i \\ s_h &= \sqrt{\frac{1}{|\mc{B}|} \sum_{i \in \mc{B}} (h_i - m_h)^2}, \end{aligned} \] where all quantities are scalars.
- Then, we use these statistics to standardize the batch activations to have mean zero and unit variance: \[ h_i \leftarrow \frac{h_i - m_h}{s_h + \epsilon}, \qquad \forall i \in \mc{B}, \tag{2}\] where \(\epsilon\) is a small number that prevents division by zero if \(h_i\) is the same for every member of the batch and \(s_h = 0\).
- Finally, the normalized variable is scaled by \(\gamma\) and shifted by \(\delta\): \[ h_i \leftarrow \gamma h_i + \delta, \qquad \forall i \in \mc{B}. \]
- After this operation, the activations have mean \(\delta\) and standard deviation \(\gamma\) across all members of the batch.
- Both of these quantities are learned during training.
NoteBatch normalization
- Batch normalization is applied independently to each hidden unit.
- In a standard neural net with \(K\) layers, each containing \(D\) hidden units, there would be \(KD\) learned offsets \(\delta\) and \(KD\) learned scales \(\gamma\).
- In a convolutional network, the normalizing stats are computed over both the batch and the spatial position.
- If there were \(K\) layers, each containing \(C\) channels, there would be \(KC\) learned offsets and \(KC\) learned scales.
- At test time, we do not have a batch from which we can gather statistics.
- To resolve this, the statistics \(m_h\) and \(s_h\) are calculated across the whole training dataset and frozen in the final network.
Costs and benefits of batch normalization
- Batch normalization makes the network invariant to rescaling the weights and biases that contribute to each activation
- if these are doubled, then the activations also double
- the estimated standard deviation \(s_h\) also doubles, and the normalization in Equation 2 compensates for these changes.
- Since this happens for each hidden unit, there will be a large family of weights and biases that all produce the same effect.
- BatchNorm adds two parameters \(\gamma\) and \(\delta\) per hidden unit, which makes the model somewhat larger.
- Hence, it both creates redundancy in the weights and biases
- and adds extra parameters to compensate for that redundancy.
- This is obviously inefficient, but BatchNorm also provides several benefits.
WarningStable forward propagation
- If we intiialize the offsets \(\delta\) to zero and the scales \(\gamma\) to one,
- then each output activation will have unit variance.
- in a regular neural net, this ensures the variance is stable during forward prop at initialization.
- in a resnet, the variance must still increase, but the \(k^{\text{th}}\) layer adds one unit of variance, instead of \(k\).
- At initialization, this has a side-effect:
- later layers make a smaller change to the overall variation than earlier ones.
- the network is effectively less deep at the start of training since later layers are close to computing the identity.
- as training proceeds, the network can increase the scales \(\gamma\) in later layers and can control its own effective depth.
WarningHigher learning rates
Empirical studies and theory both show that batch normalization makes the loss surface and its gradient change more smoothly. - This allows the use of higher learning rates as the surface is more predictable.
WarningRegularization
- Noise in the training process can improve generalization by preventing the network from overfitting to the training data.
- This is the idea behind regularization methods such as weight decay and dropout.
- Batch normalization adds noise to the training process because the statistics \(m_h\) and \(s_h\) are estimated from a random batch of training examples.
ResNet
- The ResNet-200 model contains \(200\) layers and was used for image classification on the ImageNet database.
- The resolution is decreased between adjacent ResNet blocks using convolutiosn with stride two.
- Channels are added by either appending zeros to the representation or applying an extra \(1 \times 1\) convolution.
- At the start of the network is a \(7 \times 7\) convolutional layer, followed by downsampling operation.
- At the end, a fully connected layer maps the block to a vector of length \(1000\).
- passed through a softmax layer to generate class probabilities.
- The ResNet-200 model achieved a remarkable \(4.8\%\) error rate for the correct class, being in the top five and \(20.1\%\) for identifying the correct calss correctly.
- Compare this to AlexNet (\(16.4\%\), \(38.1\%\)) and VGG (\(6.8\%\), \(23.7\%\)).
- ResNet-200 was one of the first networks to exceed human performance (\(5.1\%\) for being in the top five guesses).
- Still, this was state-of-the-art in 2016.
- Today, the best perofmring model has a \(9.0\%\) error for identifying the class correctly.
- These newer models are now based on transformers.
DenseNet
- Residual blocks receive output from the previous layer, modify it by passing it through some network layers, and add it back to the original input.
- Alternative: concatenate the modified and the original signals.
- Increases the representation size (number of channels).
- Optional subsequent linear transformation can map back to the original size (\(1 \times 1\) convolution).
- Allows the model to add the representations together, take a weighted sum, or combine them in a more complex way.
- In practice, concatenating the representations can only be sustained for a few layers
- The number of channels (hence the number of parameters to process them) becomes increasingly large.
- Problem is alleviated by applying a \(1 \times 1\) convolution to reduce the number of channels before the next \(3 \times 3\) convolution is applied.
- In a CNN, the input is periodically downsampled.
- Concatenation across the downsampling makes no sense (different representation sizes)
- The chain of concatenation is broken at this point and a smaller representation starts a new chain.
U-Nets and hourglass networks
- In U-Nets, the encoder repeatedly downsamples the image until the receptive fields are large and information is integrated across the image.
- Then, the decoder upsamples it back to the size of the origianl image.
- The final output is a probability over possible object classes at each pixel.
- One drawback of the said architecture is that the low-resolution reprsentation in the middle must “remember” the high-resolution details to make the final result accurate.
- This is unnecessary if residual connects transfer the representations from the encoder to their partner in the decoder.
- Hourglass networks are similar to U-Nets but they
- apply further convolutional alyers in the skip connections
- add the result back to the decoder rather than concatenating it.
- A series of these models form a stacked hourglass network
- they alternate between considering the image at local and global levels.
ImportantWhy residual connections help
- Batch normalization helps stabilize the forward propagation of signals through a network
- It is shows that causes gradient explosion in ReLU networks without skip connections
- Each layer increasing the magnitude of the gradients by \(\sqrt{\nicefrac{\pi}{\pi-1}} \approx 1.21\).
- This effect is also present in residual networks.
- Still, the benefit of removing the \(2^K\) increases in magnitude in the forward pass outweighs the harm done by increasing gradients by \(1.21^K\) in the backward pass.
TipGhost batch normalization (GhostNorm)
- Uses only part of the batch to compute the normalization statistics
- This makes them noisier
- increasing the amount of regularization when the batch size is very large.
- When the batch size is very small or the fluctuations within a batch are very large (e.g. NLP), the statistics in BatchNorm may become unreliable.
- Batch renormalization has been proposed to address this issue.
- It keeps track of running statistics for the mean and variance of the activations
- Another problem with batch normalization is that it is unsuitable for use in recurrent neural networks.
- These are networks for processing sequences, in which the previous output is fed back as an additional input as we move through the sequence.
- Here, the statistics must be stored at each step in the sequence
- It is unclear what to do if a test sequence is longer than the training sequence.
- These are networks for processing sequences, in which the previous output is fed back as an additional input as we move through the sequence.
- Third problem: batch normalization needs access to the whole batch
- This may not be readily available when training is distributed across several machines.
TipLayer normalization (LayerNorm)
- LayerNorm avoids using batch statistics by normalizing each data example separately
- uses statistics gathered across the channels and spatial direction.
- there is still a separate learned scale \(\gamma\) and offset \(\delta\) per channel.
TipGroup normalization (GroupNorm)
- GroupNorm is similar to LayerNorm but divides the channels into groups
- computes the statistics for each group separately across the within-group channels and spatial positions.
- there are still separate scale and offset parameters per channel.
TipInstance normalization (InstanceNorm)
- InstanceNorm takes GroupNorm to the extreme, where the number of groups \(=\) number of channels.
ImportantWhy batch normalization helps
- BatchNorm helps control the initial gradients in a residual network.
- However, the exact mechanism by which BatchNorm improves performance is not well understood.
- It is empirically verified that adding BatchNorm has the effect of smoothing out the loss surface.
- This allows for larger learning rates.
- It was empirially shown that one can exponentially incrase learning rate schedule with BatchNorm.
- This is because BatchNorm makes the newtwork invariant to the scales of the weight matrices.
- It is shown theoretically that for any parameter initialization, the distance to the nearest optimum is less for networks with batch normalization.
- Finally, BatchNorm has a regularizing effect due to statistical fluctuations from the random composition of the batch.