Normalization (归一化)

Normalization (归一化)

Why

In typical neural networks, the activations of each layer can vary drastically, leading to issues like exploding or vanishing gradients, which slow down training.

  • 主要目标:稳定训练过程,解决内部协变量偏移(Internal Covariate Shift),加速收敛。
  • 次要影响:可能间接提升泛化(非主要目的)。

Normalization is often used to:

  • increase the speed of training convergence,
  • reduce sensitivity to variations and feature scales in input data,
  • reduce overfitting,
  • and produce better model generalization to unseen data.

What

对数据分布的操作:归一化层输入或输出的分布(均值为0,方差为1)

Example

  • Batch Normalization: normalize over the batch dimension.
  • Layer Normalizatioan: normalize over the features for each individual data point.
  • Others: InstanceNorm, GroupNorm, …

不同归一化方法对比

方法 作用对象 适用场景 对训练稳定的贡献机制
Batch Norm 每层输入的每个特征通道 CNN等固定维度网络 解决协变量偏移,允许大学习率
Layer Norm 单样本的所有特征 RNN/Transformer 处理变长序列,稳定梯度
Weight Norm 权重向量 强化学习、GAN 直接约束参数尺度
Group Norm 分组后的特征通道 小批量场景(如目标检测) 替代Batch Norm解决小批次问题

How

Batch Normalization

The normalization process involves calculating the mean and variance of each feature in a mini-batch and then scaling and shifting the features using these statistics.

  1. Compute the Mean and Variance of Mini-Batches: For mini-batch of activations x1,x2,,xmx_1, x_2, \dots, x_m, the mean μB\mu_B and the variance σB2\sigma_B^2 of the mini-batch are computed.

μ=1Bi=1Bxiσ2=1Bi=1B(xiμ)2\mu = \frac{1}{B}\sum_{i=1}^B x_i \\ \sigma^2 = \frac{1}{B}\sum_{i=1}^{B}(x_i-\mu)^2

  1. Normalization: Each activation xix_i is normalized using the computed mean and variance of the mini-batch.

xi^=xiμBσB2+ϵ\hat{x_i} = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

  1. Scale and Shift the Normalized Activations: γ\gamma and β\beta are learnable parameters, which allow the optimal scaling and shifting of the normalized activations, giving the network additional flexibility.

yi=γxi^+βy_i = \gamma \hat{x_i} + \beta

Layer Normalization

The normalization process involves calculating the mean and variance for each faeture in a sample and then scaling and shifting the features using these statistics.

  1. Compute the Mean and Variance for Each Feature: The mean and variance are calculated for each input. (HH is the number of features (neurons) in the layer)

μ=1Hi=1Hxiσ2=1Hi=1H(xiμ)2\mu = \frac{1}{H}\sum_{i=1}^H x_i \\ \sigma^2 = \frac{1}{H}\sum_{i=1}^{H}(x_i-\mu)^2

  1. Normalize the Input:

xi^=xiμBσB2+ϵ\hat{x_i} = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

  1. Apply Scaling and Shifting:

yi=γxi^+βy_i = \gamma \hat{x_i} + \beta

注意事项

  1. 并非万能
    • 归一化可能损害某些任务的性能(如生成模型中对输出分布的约束)。
    • 推理阶段需固定统计量(如Batch Norm使用全局均值和方差)。
  2. 与初始化协同
    归一化常与Xavier/Kaiming初始化配合使用,共同控制信号传播尺度。

Reference

https://en.wikipedia.org/wiki/Normalization_(machine_learning)
https://www.geeksforgeeks.org/what-is-batch-normalization-in-deep-learning/
https://www.geeksforgeeks.org/what-is-layer-normalization/