Normalization (归一化)

Exisfar2025-04-182025-05-27

Normalization (归一化)

Why

In typical neural networks, the activations of each layer can vary drastically, leading to issues like exploding or vanishing gradients, which slow down training.

主要目标：稳定训练过程，解决内部协变量偏移（Internal Covariate Shift），加速收敛。
次要影响：可能间接提升泛化（非主要目的）。

Normalization is often used to:

increase the speed of training convergence,
reduce sensitivity to variations and feature scales in input data,
reduce overfitting,
and produce better model generalization to unseen data.

What

对数据分布的操作：归一化层输入或输出的分布（均值为0，方差为1）

Example

Batch Normalization: normalize over the batch dimension.
Layer Normalizatioan: normalize over the features for each individual data point.
Others: InstanceNorm, GroupNorm, …

不同归一化方法对比

方法	作用对象	适用场景	对训练稳定的贡献机制
Batch Norm	每层输入的每个特征通道	CNN等固定维度网络	解决协变量偏移，允许大学习率
Layer Norm	单样本的所有特征	RNN/Transformer	处理变长序列，稳定梯度
Weight Norm	权重向量	强化学习、GAN	直接约束参数尺度
Group Norm	分组后的特征通道	小批量场景（如目标检测）	替代Batch Norm解决小批次问题

How

Batch Normalization

The normalization process involves calculating the mean and variance of each feature in a mini-batch and then scaling and shifting the features using these statistics.

Compute the Mean and Variance of Mini-Batches: For mini-batch of activations $x_1, x_2, \dots, x_m$ , the mean $\mu_B$ and the variance $\sigma_B^2$ of the mini-batch are computed.

$\mu = \frac{1}{B}\sum_{i=1}^B x_i \\ \sigma^2 = \frac{1}{B}\sum_{i=1}^{B}(x_i-\mu)^2$

Normalization: Each activation $x_i$ is normalized using the computed mean and variance of the mini-batch.

$\hat{x_i} = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$

Scale and Shift the Normalized Activations: $\gamma$ and $\beta$ are learnable parameters, which allow the optimal scaling and shifting of the normalized activations, giving the network additional flexibility.

$y_i = \gamma \hat{x_i} + \beta$

Layer Normalization

The normalization process involves calculating the mean and variance for each faeture in a sample and then scaling and shifting the features using these statistics.

Compute the Mean and Variance for Each Feature: The mean and variance are calculated for each input. ( $H$ is the number of features (neurons) in the layer)

$\mu = \frac{1}{H}\sum_{i=1}^H x_i \\ \sigma^2 = \frac{1}{H}\sum_{i=1}^{H}(x_i-\mu)^2$

Normalize the Input:

$\hat{x_i} = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$

Apply Scaling and Shifting:

$y_i = \gamma \odot \hat{x_i} + \beta$

RMSNorm (Root Mean Square Layer Normalization)

RMSNorm 是一种层归一化（Layer Normalization）的变体，由 Zhang & Sennrich (2019) 在论文《https://arxiv.org/abs/1909.03341》中提出，用于替代传统的 LayerNorm，主要应用于深度学习模型（如 Transformer）中，提升训练稳定性和效率。

核心思想

RMSNorm 对输入向量 仅缩放（scale）不偏移（shift），即：

传统 LayerNorm：对每个神经元的输出进行 均值中心化 + 方差归一化 + 缩放/偏移。
RMSNorm：仅用均方根（RMS）进行缩放，省略均值中心化和偏移操作。

数学公式

(1) 传统 LayerNorm

对于输入向量 $ \mathbf{x} \in \mathbb{R}^d $，LayerNorm 的计算为：

$\text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sigma} + \beta$

$\mu = \frac{1}{d}\sum_{i=1}^d x_i$ （均值）
$\sigma = \sqrt{\frac{1}{d}\sum_{i=1}^d (x_i - \mu)^2}$ （标准差）
$\gamma, \beta$ 是可学习的缩放和偏移参数。

(2) RMSNorm

RMSNorm 简化了计算，仅保留均方根（RMS）缩放：

$\text{RMSNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x}}{\text{RMS}(\mathbf{x})}$

$\text{RMS}(\mathbf{x}) = \sqrt{\frac{1}{d}\sum_{i=1}^d x_i^2}$ （均方根）
$\gamma$ 是可学习的缩放参数（无偏移 $\beta$ ）。

为什么 RMSNorm 更高效？

特性	LayerNorm	RMSNorm
计算复杂度	高（需计算均值、方差）	低（仅需 RMS）
参数数量	2d	（$ \gamma, \beta$ ）
训练稳定性	依赖均值中心化	对初始化更鲁棒
适用场景	传统 Transformer	大模型（如 LLaMA、GPT）

优势：

减少 25%~50% 的计算量（尤其在深层网络中显著）。
省略偏移参数 $\beta$ ，降低过拟合风险。
在大语言模型（LLM）中表现优异（如 LLaMA 系列全面采用 RMSNorm）。

与 LayerNorm 的对比实验

效果：在机器翻译、语言建模等任务中，RMSNorm 与 LayerNorm 性能相当，但计算更快。
论文结果（Zhang & Sennrich, 2019）：
- 训练速度提升20%~30%。
- 在深层Transformer中稳定性更好。

应用场景

大语言模型（LLM）：
- LLaMA、GPT-NeoX 等模型均采用 RMSNorm。
轻量化模型：
- 对计算资源敏感的场景（如移动端、边缘设备）。
替代LayerNorm：
- 在需要高效归一化时可直接替换。

Normalization的注意事项

并非万能：
• 归一化可能损害某些任务的性能（如生成模型中对输出分布的约束）。
• 推理阶段需固定统计量（如Batch Norm使用全局均值和方差）。
与初始化协同：
归一化常与Xavier/Kaiming初始化配合使用，共同控制信号传播尺度。