问题集:概率统计

问题集:概率统计

统计推断的两大学派:频率派和贝叶斯派

统计有两大分支:统计描述 (Descriptive statistics)、统计推断 (Statistical inference)

The frequentist views probability as the long-run frequency of events, making inferences based on sample data. In contrast, the Bayesian treats probability as a measure of subjective belief, incorporating prior knowledge into its inferences. The key difference lies in their definitions and interpretations of probability, yet the two approaches can also complement each other.

  • Frequentist: Parameters are fixed unknown constants estimated from data (e.g., sample mean estimates population mean).
  • Bayesian: Parameters are random variables with prior distributions (e.g., assuming a coin bias toward 0.6), updated to posterior distributions using data.

后验分布为什么有用?(相比于频率派的MLE)

  • ​​更全面的参数认知​​:不仅知道“最可能值”,还知道它的不确定性、分布形状。
  • ​​更稳健的预测​​:考虑所有可能的参数,而不是依赖单一估计。
  • ​​更直接的决策支持​​:可以计算概率、比较假设、优化行动。
  • ​​小样本下更稳定​​:先验信息防止过拟合

Maximum-likelihood Estimation (极大似然估计)

Why

To estimate the parameters θ\theta of a statistical model PG(x;θ)P_G(x; \theta) so that it best matches the true data distribution Pdata(x)P_{data}(x).

What

  • MLE selects the parameter θ\theta that maximizes the likelihood function L(θ)\mathcal{L}(\theta), which measures how probable the observed data is under the model.
  • For independent samples {x1,x2,...,xm}\{x^1, x^2, ..., x^m\}, the likelihood is:

L(θ)=i=1mPG(xi;θ)\mathcal{L}(\theta) = \prod_{i=1}^m P_G(x^i; \theta)

How

Step-by-Step Process:

  1. Assume a model: Choose a parametric distribution PG(x;θ)P_G(x; \theta) (e.g., Gaussian, Bernoulli).
  2. Collect data: Observe samples {x1,...,xm}\{x^1, ..., x^m\} from Pdata(x)P_{data}(x).
  3. Write the likelihood: Compute L(θ)\mathcal{L}(\theta) or logL(θ)\log \mathcal{L}(\theta).
  4. Optimize: Find θ\theta^* that maximizes the (log-)likelihood:

θ=argmaxθlogL(θ)\theta^* = \arg \max_\theta \log \mathcal{L}(\theta)

  • For simple models (e.g., Gaussian mean), this can be solved analytically.
  • For complex models (e.g., neural networks), gradient-based optimization (e.g., SGD) is used.

参数化方法和非参数化方法

  • Parametric approach:
    • Def: use probability distributions having specific functional forms governed by a small number of parameters whose values are to be determined from a data set.
    • Limitation: the chosen density might be a poor model of the distribution that generates the data, which can result in poor predictive performance.
  • Nonparametric approach:
    • Def: make few assumptions about the form of the distribution.
    • Example: Histogram methods, kernel density estimator.