Regularization (正则化)

Exisfar2025-04-182025-04-20

Regularization (正则化)

Why

主要目标：防止过拟合，通过显式或隐式约束模型复杂度，提升泛化性能。
次要目标：可能加速训练（如L2稳定优化）。

In mathematics, statistics, finance,[1] and computer science, particularly in machine learning and inverse problems, regularization is a process that converts the answer of a problem to a simpler one. It is often used in solving ill-posed problems or to prevent overfitting.[2]

What

Explicit Regularization

直接修改损失函数或模型结构。

Example

L1/L2正则化：在损失函数中添加参数惩罚项。
Elastic net regularization: combines L1 and L2 panalties.
Dropout：在训练过程中随机将一部分神经元（及其连接）暂时“丢弃”（输出置零），每次迭代时随机选择不同的神经元子集。

Implicit Regularization

通过训练策略间接约束模型。

Example

Early Stop
数据增强
权重衰减

How

L1 and L2 Regularization

L1 regularization (also called LASSO) leads to sparse models by adding a penalty based on the absolute value of coefficients.

$Cost = \frac{1}{n}\sum_{i=1}^n(y_i-\hat{y_i})^2 + \lambda\sum_{i=1}^m |w_i|$

L2 regularization (also called ridge regression) encourages smaller, more evenly distributed weights by adding a penalty based on the square of the coefficients.[4]

$Cost = \frac{1}{n}\sum_{i=1}^n(y_i-\hat{y_i})^2 + \lambda\sum_{i=1}^m w_i^2$

Elastic net regularization

Elastic Net Regression is a combination of both L1 as well as L2 regularization.

$Cost = \frac{1}{n}\sum_{i=1}^n(y_i-\hat{y_i})^2 + \lambda((1-\alpha)\sum_{i=1}^m |w_i| + \alpha \sum_{i=1}^m w_i^2)$

Dropout

Dropout (also called DropConnect) is a regularization techniques for reducing overfitting in neural networks by preventing complex co-adaptations on training data. It’s an efficient way of performing model averaging with neural networks.[3] It randomly “dropping out”, or omitting, units (both hidden and visible) during the training process of a neural network.

$\hat{w}_j = \begin{cases} w_j, & \text{with } P(c) \\ 0, & \text{otherwise} \end{cases}$

The Difference from Normalization

Regularization像“交通规则”：强制模型遵守简单规则（如限速），避免危险行为（过拟合）。
Normalization像“车辆保养”：确保发动机（梯度）运行平稳，但不管驾驶员是否超速（过拟合）。

调参优先级：先确保训练稳定（Normalization），再控制过拟合（Regularization）

Reference

https://en.wikipedia.org/wiki/Regularization_(mathematics)
https://www.geeksforgeeks.org/regularization-in-machine-learning/
https://en.wikipedia.org/wiki/Dilution_(neural_networks)
[1] Kratsios, Anastasis (2020). “Deep Arbitrage-Free Learning in a Generalized HJM Framework via Arbitrage-Regularization Data”. Risks. 8 (2): [1]. doi:10.3390/risks8020040. hdl:20.500.11850/456375. Term structure models can be regularized to remove arbitrage opportunities [sic?].
[2] Bühlmann, Peter; Van De Geer, Sara (2011). Statistics for High-Dimensional Data. Springer Series in Statistics. p. 9. doi:10.1007/978-3-642-20192-9. ISBN 978-3-642-20191-2. If p > n, the ordinary least squares estimator is not unique and will heavily overfit the data. Thus, a form of complexity regularization will be necessary.
[3] Hinton, Geoffrey E.; Srivastava, Nitish; Krizhevsky, Alex; Sutskever, Ilya; Salakhutdinov, Ruslan R. (2012). “Improving neural networks by preventing co-adaptation of feature detectors”. arXiv:1207.0580 [cs.NE].