Normalization (归一化)

Exisfar2025-04-182025-04-20

Normalization (归一化)

Why

In typical neural networks, the activations of each layer can vary drastically, leading to issues like exploding or vanishing gradients, which slow down training.

主要目标：稳定训练过程，解决内部协变量偏移（Internal Covariate Shift），加速收敛。
次要影响：可能间接提升泛化（非主要目的）。

Normalization is often used to:

increase the speed of training convergence,
reduce sensitivity to variations and feature scales in input data,
reduce overfitting,
and produce better model generalization to unseen data.

What

对数据分布的操作：归一化层输入或输出的分布（均值为0，方差为1）

Example

Batch Normalization: normalize over the batch dimension.
Layer Normalizatioan: normalize over the features for each individual data point.
Others: InstanceNorm, GroupNorm, …

不同归一化方法对比

方法	作用对象	适用场景	对训练稳定的贡献机制
Batch Norm	每层输入的每个特征通道	CNN等固定维度网络	解决协变量偏移，允许大学习率
Layer Norm	单样本的所有特征	RNN/Transformer	处理变长序列，稳定梯度
Weight Norm	权重向量	强化学习、GAN	直接约束参数尺度
Group Norm	分组后的特征通道	小批量场景（如目标检测）	替代Batch Norm解决小批次问题

How

Batch Normalization

The normalization process involves calculating the mean and variance of each feature in a mini-batch and then scaling and shifting the features using these statistics.

Compute the Mean and Variance of Mini-Batches: For mini-batch of activations $x_1, x_2, \dots, x_m$ , the mean $\mu_B$ and the variance $\sigma_B^2$ of the mini-batch are computed.

$\mu = \frac{1}{B}\sum_{i=1}^B x_i \\ \sigma^2 = \frac{1}{B}\sum_{i=1}^{B}(x_i-\mu)^2$

Normalization: Each activation $x_i$ is normalized using the computed mean and variance of the mini-batch.

$\hat{x_i} = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$

Scale and Shift the Normalized Activations: $\gamma$ and $\beta$ are learnable parameters, which allow the optimal scaling and shifting of the normalized activations, giving the network additional flexibility.

$y_i = \gamma \hat{x_i} + \beta$

Layer Normalization

The normalization process involves calculating the mean and variance for each faeture in a sample and then scaling and shifting the features using these statistics.

Compute the Mean and Variance for Each Feature: The mean and variance are calculated for each input. ( $H$ is the number of features (neurons) in the layer)

$\mu = \frac{1}{H}\sum_{i=1}^H x_i \\ \sigma^2 = \frac{1}{H}\sum_{i=1}^{H}(x_i-\mu)^2$

Normalize the Input:

$\hat{x_i} = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$

Apply Scaling and Shifting:

$y_i = \gamma \hat{x_i} + \beta$

注意事项

并非万能：
• 归一化可能损害某些任务的性能（如生成模型中对输出分布的约束）。
• 推理阶段需固定统计量（如Batch Norm使用全局均值和方差）。
与初始化协同：
归一化常与Xavier/Kaiming初始化配合使用，共同控制信号传播尺度。

Reference

https://en.wikipedia.org/wiki/Normalization_(machine_learning)
https://www.geeksforgeeks.org/what-is-batch-normalization-in-deep-learning/
https://www.geeksforgeeks.org/what-is-layer-normalization/