问题集：概率统计

Exisfar2025-05-112025-05-14

问题集：概率统计

统计推断的两大学派：频率派和贝叶斯派

统计有两大分支：统计描述 (Descriptive statistics)、统计推断 (Statistical inference)

The frequentist views probability as the long-run frequency of events, making inferences based on sample data. In contrast, the Bayesian treats probability as a measure of subjective belief, incorporating prior knowledge into its inferences. The key difference lies in their definitions and interpretations of probability, yet the two approaches can also complement each other.

Frequentist: Parameters are fixed unknown constants estimated from data (e.g., sample mean estimates population mean).
Bayesian: Parameters are random variables with prior distributions (e.g., assuming a coin bias toward 0.6), updated to posterior distributions using data.

后验分布为什么有用？（相比于频率派的MLE）

更全面的参数认知：不仅知道“最可能值”，还知道它的不确定性、分布形状。
更稳健的预测：考虑所有可能的参数，而不是依赖单一估计。
更直接的决策支持：可以计算概率、比较假设、优化行动。
小样本下更稳定：先验信息防止过拟合

Maximum-likelihood Estimation (极大似然估计)

Why

To estimate the parameters $\theta$ of a statistical model $P_G(x; \theta)$ so that it best matches the true data distribution $P_{data}(x)$ .

What

MLE selects the parameter $\theta$ that maximizes the likelihood function $\mathcal{L}(\theta)$ , which measures how probable the observed data is under the model.
For independent samples $\{x^1, x^2, ..., x^m\}$ , the likelihood is:

$\mathcal{L}(\theta) = \prod_{i=1}^m P_G(x^i; \theta)$

How

Step-by-Step Process:

Assume a model: Choose a parametric distribution $P_G(x; \theta)$ (e.g., Gaussian, Bernoulli).
Collect data: Observe samples $\{x^1, ..., x^m\}$ from $P_{data}(x)$ .
Write the likelihood: Compute $\mathcal{L}(\theta)$ or $\log \mathcal{L}(\theta)$ .
Optimize: Find $\theta^*$ that maximizes the (log-)likelihood:

$\theta^* = \arg \max_\theta \log \mathcal{L}(\theta)$

For simple models (e.g., Gaussian mean), this can be solved analytically.
For complex models (e.g., neural networks), gradient-based optimization (e.g., SGD) is used.

参数化方法和非参数化方法

Parametric approach:
- Def: use probability distributions having specific functional forms governed by a small number of parameters whose values are to be determined from a data set.
- Limitation: the chosen density might be a poor model of the distribution that generates the data, which can result in poor predictive performance.
Nonparametric approach:
- Def: make few assumptions about the form of the distribution.
- Example: Histogram methods, kernel density estimator.