Assumption of Linear Regression

Posted on 2021-12-10 Edited on 2022-12-15 In Mathematics Views: Symbols count in article: 22k Reading time ≈ 20 mins.

Linear Regression

线性模型通常是：\(Y=f\left(X_{1}, X_{2}, X_{3}\right)+\varepsilon\)的形式。这里predictor本身不需要一定是线性的，可以通过一些transform的形式，将非线性的关系转变为线性的。线性模型的矩阵表达：\(y=X \beta+\varepsilon\)，这里y为n×1的矩阵。那么接下来问题就是如何估计β，以尽可能多的用x解释y。在本文中，p为β中参数的个数。

我们用\(\hat{y}=X \hat{\beta}\) 表示我们的预测值（predicted or fitted value），\(\hat{\varepsilon} = y-\hat y\) 表示残差（residual）。这样，我们的y是n维，确定的变量是p维，剩下n-p维就是不确定的random variation（可以用QR分解证明）。不带^的为真实值。

最小二乘估计Least Square（LS）

\[ X^{\top} X \hat{\beta}=X^{\top} y \]

上述就是normal equation。

事实上有很多种估计β的方法，但为什么要用最小二乘估计呢，有以下三点原因：

它的结果恰好符合：正交投影到模型空间的结果。符合我们的几何直觉。
如果error是独立同分布和正态分布，那么对β的估计与MLE（maximum likelihood estimation）的结果相同（结果可由MLE公式推导）。简单来说，MLE是让被观察数据出现的概率最大。
Gauss-Markov Theorem 保证了最小二乘估计的结果\(\hat \beta\)是最优线性无偏估计（BLUE）。简单的数学推导

Gauss-Markov Theorem ：

If the errors in the linear regression model are uncorrelated不相关, have equal variances同方差 and expectation value of zero期望为0:

Then the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators.

Note: The errors do not need to be normal, nor do they need to be independent and identically distributed (only uncorrelated with mean zero and homoscedastic with finite variance). The requirement that the estimator be unbiased cannot be dropped, since biased estimators exist with lower variance. See, for example, the James–Stein estimator (which also drops linearity), ridge regression, or simply any degenerate estimator.

Gauss-Markov Theorem

假设：\(E \varepsilon=0\)，\(\operatorname{Var} \varepsilon=\sigma^{2} I\)。

公式1是LS得到的对β的估计。这时，我们并不知道β这个估计怎么样。那么有下述几条推论，他们的证明参见这里。

首先，我们定义Estimable Function：

Estimable Function：如果存在a，使得\(E a^{\top} y=q^{\top} \beta\)成立，其中q是已知的，那么我们说\(q^\top\beta\)是estimable function。

Theorem 1: \(q^\top \beta\) is estimable, if and only if there exists a vector t such that \(q^\top=t^\top X\).

证明见P3

Theorem 2: \(q^\top \beta\) is estimable, if and only if \(X^{\top} X u=q\) has solution for u.

证明见P3. 另判断非齐次线性方程组是否有解，需要比较系数矩阵和增广矩阵的秩的大小，这时若系数矩阵的行列式为0，说明有可能无解，也有可能有无穷多解。

Theorem 3: If \(q^\top \beta\) is estimable, then \(q^\top \hat\beta\) is an unbiased estimator of \(q^\top\beta\), and this solution \(\hat\beta\) is invariant with the solution of \(X^{\top} X \hat\beta=X^\top y\).

简单说，如果\(q^\top \beta\) 是 estimable，那么我们只需要找到任意一个符合\(X^{\top} X \hat\beta=X^\top y\)的解\(\hat \beta\)，这样\(q^\top \hat\beta\) 就是\(q^\top\beta\) 的无偏估计。

证明见P4

Theorem 4: Of all unbiased linear estimators of the estimable function, \(q^\top\hat\beta\) have the smallest variance.

简单说，\(q^\top \beta\) 的所有无偏估计里，\(q^\top \hat \beta\) 的方差是最小的，也就是估计最稳定的，最优的。

证明见P5

综上，\(\hat \beta\) 是\(\beta\) 的BLUE（best linear unbiased estimation），最优线性无偏估计。

古典线性回归模型的假设

【Linearity】线性模型
【Weak exogeneity】\(E[\varepsilon \mid X]=0\)，由此可得X与ε是uncorrelated，并且error期望为0. Exogeneity
【Lack of perfect multicolinearity】X是linearly independent线性无关，\(\operatorname{Pr}(\operatorname{rank}(X)=p)=1\)。如果这个假设被违背，那么说明X线性相关（or perfectly multicollinear）. Not perfectly linearly dependent
【Constant Variance， Independence】\(\operatorname{Var}(\varepsilon \mid X)=\sigma^{2} I_{n}\)，意味着
- 方差齐性，\(E\left(\varepsilon_{i}^{2} \mid X\right)=\sigma^{2}\)
- 没有自相关性（autocorrelation），即不同观测的error是不相关的（uncorrelated）：\(E\left[\varepsilon_{i} \varepsilon_{j} \mid X\right]=0\)，for \(i\ne j\)。
有些地方，还会有这样一个假设：被观测对象(x,y)是独立同分布iid。这意味着所有的样本，是遵循随机采样的原则选出来的。

一些基本性质

projection matrix：\(H=X\left(X^{\top} X\right)^{-1} X^{\top}\)
- \(H^\top = H\)
fitted values: \(\hat{y}=X \hat{\beta}=H y\)
- 我们可以理解这个式子为把y投影到X空间里，这样误差就位于正交于X的空间里。
residuals: \(\hat{\varepsilon}=y-\hat{y}=(I-H) y\)
基于上述假设，推出：
- \(\hat{\beta} \sim N\left(\beta, \sigma^{2}\left(X^{\top} X\right)^{-1}\right)\)
  - \(\operatorname{Var}(\hat{\beta})=\left(X^{\top} X\right)^{-1} \sigma^{2}\)，证明见STATS 500 ch2课件P28
  - \(\operatorname{Var}\left(\hat{\beta}_{j}\right)=\left(X^{\top} X\right)_{j j}^{-1} \sigma^{2}\)，这里σ可以用下面无偏估计量估计。
- \(\hat{\sigma}^{2} \sim \frac{\sigma^{2}}{n-p} \cdot \chi_{n-p}^{2}\)
  - \(\hat{\sigma}^2=\frac{\sum\left(y_{i}-\hat{y}_{i}\right)^{2}}{n-p}\)作为\(\sigma^2\)的无偏估计量，这里n-p是自由度，\(E(\hat\sigma^2)=\sigma^2\)，证明过程同证明\(E\left(\hat{\varepsilon} \hat{\varepsilon}^{\top}\right)=(n-p) \sigma^{2}=RSS\)，参见STATS 500作业3第2题。
\(A d j \cdot R^{2}=1-\frac{R S S /(n-p-1)}{T S S / n-1}\)

假设检验及分布

假设检验：p值是当原假设为真时，拒绝原假设的概率。

小概率原理：小概率事件在一次试验中是几乎不发生的。若H0为真，样本落在拒绝域W是小概率事件，不应发生。如发生，则拒绝原假设。

如果仅仅是为了估计β，那么我们并不需要假设error的分布，但是当我们做假设检验时，分布就是很重要的。

卡方分布及卡方检验

假设自变量\(X_{1}, \ldots, X_{n} \sim N(0,1)\)，且相互独立，则： \[ \chi^{2}=\sum_{i=1}^{n} X_{i}^{2} \]

t分布及t检验

假设\(X \sim N(0,1)\)，\(Y \sim \chi^{2}(n)\)，且X和Y相互独立 \[ T=\frac{X}{\sqrt{Y/ n}} \]

F分布及F检验

假设\(X\sim \chi^{2}(n_1)\)，\(Y\sim \chi^{2}(n_2)\)，且X和Y相互独立： \[ F=\frac{X / n_{1}}{Y / n_{2}} \]

Inference and Diagnosis

在残差服从正态分布的情况下：

系数显著性检验

从前文中我们可以知道，β符合正态分布，但是我们并不知道β的均值和方差，只能用样本来估计，因此采用了下述检验方法：

检验单个回归系数是否显著（\(H_0:\beta_i=0\)）：t检验
- \(T=\frac{\hat{\beta_{i}}-0}{SE_{\hat{\beta}_{i}}}\)，0可以替换为其他数字
- \(t_i^2=F_\omega\)
- 置信区间\(\hat{\beta}+t_{n-p}^{(\alpha / 2)} \operatorname{se}(\hat{\beta})\)
检验回归系数整体（模型整体）是否显著（\(H_0:\beta_1=\dots=\beta_p=0\)）：F检验（ANOVA）
1. \(F=\frac{\left(\mathrm{RSS}_\omega-\mathrm{RSS}_{\Omega}\right) /(p-q)}{\operatorname{RSS}_{\Omega} /(n-p)} \sim F_{p-q, n-p}\)，这里ω代表Null 假设，Ω代表备择假设模型。因为\(\frac{RSS}{\sigma^2}\sim \chi^2_{n-p}\).
2. \(F=\frac{(TSS-RSS)/ (p-1)}{\operatorname{RSS} /(n-p)}\)
3. 统计量大，表示null可以被拒绝
4. 多个变量的置信区间 \((\hat{\beta}-\beta)^T X^T X(\hat{\beta}-\beta) \leq p \hat{\sigma}^2 F_{p, n-p}^{(\alpha)}\)
置换检验：permutation test，可以脱离正态分布的假设对统计量进行检验。 ## Prediction \[ \hat{y_{0}}=x_{0}^{T} \hat{\beta} \] 因为\(\operatorname{Var}\left(x_0^{\top} \hat{\beta}\right)=x_0^{\top}\left(x^{\top} x\right)^{-1} x_0 \sigma^2\)所以预测区间为：\(\hat{y}_0 \pm t_{n-p}^{(\alpha / 2)} \hat{\sigma} \sqrt{1+x_0^T\left(X^T X\right)^{-1} x_0}\)

检验Error 假设

independence，constant variance，normality - plot residual vs y,根据形状对y进行transform - In practice, you need to look at the plot of the residuals and fitted values and take a guess at the relationship. When looking at the plot, we see the change in \(S D(y)\) rather than var \(y\), because the SD is in the units of the response. If your initial guess is wrong, you will find that the diagnostic plot in the transformed scale is unsatisfactory. You can simply try another transformation - some experimentation is sensible. ——From linear regression with r P77 - - square root 对于count类的数据有用，因为poisson分布的mean和variance相等。 ### 正态性

QQ-plot: residual against quantile \(\Phi^{-1}\left(\frac{i}{n+1}\right)\)
Shapiro- wilk test
- \(W = \frac{\Sigma a_ix_i}{\Sigma(x_i-\bar x)^2}\)

Correlated error

the Durbin-Watson test uses the statistic: \[D W=\frac{\sum_{i=2}^n\left(\hat{\varepsilon}_i-\hat{\varepsilon}_{i-1}\right)^2}{\sum_{i=1}^n \hat{\varepsilon}_i^2}\] Null hypothesis : errors are uncorrelated.
如果DW约等于2，没有自相关性
如果0<DW<1，positive 自相关性
如果3<DW<4，negative 自相关性
解决办法：include其他feature，可能是时间序列

检验Error Homoskedasticity

异方差不影响参数的unbiasedness，但是这时，估计参数不再是BLUE，并且检验统计量用到的standard error不再是无偏的（即t检验中的分母），\(Var(\hat\beta_i)\)是不准确的， variance低估了，即置信区间的长度也低估了。

White Test
Residual Plot （Residual - Fitted value OR X）

Violation （Heteroskedasticity）处理方法：

Standard Error 代替为White's standard error
Transformation，当发现残差具有显著趋势时
1. Box-cox transformation，y变为\(f(y)=\frac{y^\lambda-1}{\lambda}\), \(\lambda=0\)，用log y。
2. \(\lambda\) 让异方差最小，方差最趋近于常数
WLS（Weighed Least Square）
有自相关性。β的variance也会被低估。

Unusual Observation

Leverage

\(h_i=H_{i i}\) are called leverages and are useful diagnostics. Since var \(\hat{\varepsilon}_i=\sigma^2\left(1-h_i\right)\), a large leverage, \(h_i\), will make var \(\hat{\varepsilon}_i\) small. The fit will be attracted toward \(y_i\). Large values of \(h_i\) are due to extreme values in the \(X\)-space. \(h_i\) corresponds to a (squared) Mahalanobis distance defined by \(X\) which is \((x-\bar{x})^T \hat{\Sigma}^{-1}(x-\bar{x})\) where \(\hat{\Sigma}\) is the estimated covariance of \(X\). The value of \(h_i\) depends only on \(X\) and not \(y\) so leverages contain only partial information about a case.

Since \(\sum_i h_i=p\), an average value for \(h_i\) is \(p / n\). A rough rule is that leverages of more than \(2 p / n\) should be looked at more closely.

Or one can use half normal plots: The steps are: 1. Sort the data: \(x_{[1]} \leq \ldots x_{[n]}\). 2. Compute \(u_i=\Phi^{-1}\left(\frac{n+i}{2 n+1}\right)\). 3. Plot \(x_{[i]}\) against \(u_i\)

Outlier

An outlier is a point that does not fit the current model well.

To detect such points, we exclude point \(i\) and recompute the estimates to get \(\hat{\beta}_{(i)}\) and \(\hat{\sigma}_{(i)}^2\) where \((i)\) denotes that the \(i^{t h}\) case has been excluded. Hence:\(\hat{y}_{(i)}=x_i^T \hat{\beta}_{(i)}\).

If \(\hat{y}_{(i)}-y_i\) is large, then case \(i\) is an outlier. To judge the size of a potential outlier, we need an appropriate scaling. We find: \[ \operatorname{vâr}\left(y_i-\hat{y}_{(i)}\right)=\hat{\sigma}_{(i)}^2\left(1+x_i^T\left(X_{(i)}^T X_{(i)}\right)^{-1} x_i\right) \] and so we define the studentized (sometimes called jackknife or crossvalidated) residuals as: \[ t_i=\frac{y_i-\hat{y}_{(i)}}{\hat{\sigma}_{(i)}\left(1+x_i^T\left(X_{(i)^T}^T X_{(i)}\right)^{-1} x_i\right)^{1 / 2}} \] which are distributed \(t_{n-p-1}\) if the model is correct and \(\varepsilon \sim N\left(0, \sigma^2 I\right)\). Fortunately, there is an easier way to compute \(t_i\) : \[ t_i=\frac{\hat{\varepsilon}_i}{\hat{\sigma}_{(i)} \sqrt{1-h_i}}=r_i\left(\frac{n-p-1}{n-p-r_i^2}\right)^{1 / 2} \] which avoids doing \(n\) regressions. Since \(t_i \sim t_{n-p-1}\), we can calculate a \(p\)-value to test whether case \(i\) is an outlier.

Bonferroni correction: Suppose we want a level \(\alpha\) test. Now \(P(\) all tests accept \()=1-P(\) at least one rejects \() \geq 1-\sum_i P\) (test \(i\) rejects \()=1-n \alpha\). So this suggests that if an overall level \(\alpha\) test is required, then a level \(\alpha / n\) should be used in each of the tests.
Cook's distance: measure sensitivity of the fitted value in a regression to dropping a single observation j. \[ D_{j}=\frac{\sum_{i=1}^{n}\left(\hat{Y}_{i}^{-j}-\hat{Y}_{i}\right)^{2}}{p S^{2}} \] S is the estimate of error variance from the model using all observations.If Cook's distance > 1, then j has large impact.

Influential Observations

An influential point is one whose removal from the dataset would cause a large change in the fit.

Consider the change in the coefficients \(\hat{\beta}-\hat{\beta}_{(i)}\).
Cook statistic against half-normal quantiles. The Cook statistics: \[ \begin{aligned} D_i &=\frac{\left(\hat{y}-\hat{y}_{(i)}\right)^T\left(\hat{y}-\hat{y}_{(i)}\right)}{p \hat{\sigma}^2} \\ &=\frac{1}{p} r_i^2 \frac{h_i}{1-h_i} \end{aligned} \]

Problems with Predictors

改变x的scale不改变检验统计量和σ的估计。 ## 筛选自变量：

注：增加无关自变量会导致\(R^2\)增加，但是\(Adj.R^2\)减小

General to specific model selection:
- 先包括全部变量，然后逐个删除最小 t 统计量的，直到没有不显著的统计量
M-fold cross-validation:

Multicollinearity

由于共线性变量的存在，有可能两个变量t 检验统计量都不显著，但是删除某一个后，另一个变显著。只要不是perfectly dependent，就不影响BLUE。注：线性无关不代表基向量正交。

\(VIF_i(x_i)=\frac{1}{1-R_i^2}\)

VIF < 5 , OK ; VIF > 10 , collinearity.
feature selection
合并为一个自变量
ridge regression
完全共线性：矩阵无逆。剔除一些变量
共线性导致对于β估计不准确，The signs of the coefficients can be the opposite of what intuition about the effect of the predictor might suggest. The standard errors are inflated so that t-tests may fail to reveal significant factors. 显著性检验失效
- Imprecise estimate of \(\beta\)
- \(t\)-test fails to reveal significant predictors
- Sensitivity to measurement errors
- Numerical instability
检验方法：
1. 看变量的corr matrix
2. A regression of \(x_i\) on all other predictors gives \(R_i^2 . R_i^2\) close to one indicates a problem.
3. Examine the eigenvalues of \(X^T X, \lambda_1 \geq \cdots \geq \lambda_p \geq 0\). Zero eigenvalues denote exact collinearity while the presence of some small eigenvalues indicates multicollinearity. The condition number \(\kappa\) measures the relative sizes of the eigenvalues and is defined as:\(\kappa=\sqrt{\frac{\lambda_1}{\lambda_p}}\), where \(\kappa \geq 30\) is considered large.
4. The effect of collinearity can be seen by this expression for \(\operatorname{var} \hat{\beta}_j\) :\(\operatorname{var} \hat{\beta}_j=\sigma^2\left(\frac{1}{1-R_j^2}\right) \frac{1}{\sum_i\left(x_{i j}-\bar{x}_j\right)^2}\). We can see that if the predictor \(x_j\) does not vary much, then the variance of \(\hat{\beta}_j\) will be large. If \(R_j^2\) is close to one, then the variance inflation factor \(\left(1-R_j^2\right)^{-1}\) will be large and so var \(\hat{\beta}_j\) will also be large.

Problem with Errors

Errors are correlated:https://math.stackexchange.com/questions/2957686/explain-about-the-correlation-of-error-terms-in-linear-regression-models. 总结一下就是，如果error相关，那么估计出来的置信区间将会比正常更窄。 ## Errors are dependent: GLS We have assumed that \(\operatorname{var} \varepsilon=\sigma^2 I\). Suppose instead that var \(\varepsilon=\sigma^2 \Sigma\) where \(\sigma^2\) is unknown but \(\Sigma\) is known - in other words, we know the correlation and relative variance between the errors, but we do not know the absolute scale of the variation.

\(\Sigma=S S^T\), where \(S\) is a triangular matrix using the Choleski decomposition. Then \[ \begin{aligned} S^{-1} y &=S^{-1} X \beta+S^{-1} \varepsilon \\ y^{\prime} &=X^{\prime} \beta+\varepsilon^{\prime} \end{aligned} Now we find that: \]

\[ \operatorname{var} \varepsilon^{\prime}=\operatorname{var}\left(S^{-1} \varepsilon\right)=S^{-1}(\operatorname{var} \varepsilon) S^{-T}=S^{-1} \sigma^2 S S^T S^{-T}=\sigma^2 I \] So we can reduce GLS to ordinary least squares (OLS) by a regression of \(y^{\prime}=S^{-1} y\) on \(X^{\prime}=S^{-1} X\) which has error \(\varepsilon^{\prime}=S^{-1} \varepsilon\) that is i.i.d. \[ \hat{\beta}=\left(X^T \Sigma^{-1} X\right)^{-1} X^T \Sigma^{-1} y \] - Block Effect: compound symmetry assumption - Spatial data

Independent but not identically distributed: WLS

Levene Test：检验方差齐性，并且不依赖正态假设。

Sometimes the errors are uncorrelated, but have unequal variance where the form of the inequality is known. In such cases, \(\Sigma\) is diagonal but the entries are not equal. Weighted least squares (WLS) is a special case of GLS and can be used in this situation. We set \(\Sigma=\operatorname{diag}\left(1 / w_1, \ldots, 1 / w_n\right)\), where the \(w_i\) are the weights so \(S=\operatorname{diag}\left(\sqrt{1 / w_1}, \ldots, \sqrt{1 / w_n}\right)\). We then regress \(\sqrt{w_i} y_i\) on \(\sqrt{w_i} x_i\) (although the column of ones in the \(X\)-matrix needs to be replaced with \(\sqrt{w_i}\) ). When weights are used, the residuals must be modified to use \(\sqrt{w_i} \hat{\varepsilon}_i\). Some examples: 1. Errors proportional to a predictor: \(\operatorname{var}\left(\varepsilon_i\right) \propto x_i\) suggests \(w_i=x_i^{-1}\). One might choose this option after observing a positive relationship in a plot of \(\left|\hat{\varepsilon}_i\right|\) against \(x_i\) 2. When the \(Y_i\) are the averages of \(n_i\) observations, then var \(y_i=\operatorname{var} \varepsilon_i=\sigma^2 / n_i\), which suggests \(w_i=n_i\). Responses that are averages arise quite commonly, but take care that the variance in the response really is proportional to the group size. 3. When the observed responses are known to be of varying quality, weights may be assigned \(w_i=1 / s d\left(y_i\right)\). 4. 补充：constrained least squares problem，适合对于系数有限制条件的回归。

Not Normal: Robust Regression

Jarque–Bera检验：是对样本数据是否具有符合正态分布的偏度和峰度的拟合优度的检验。

When the errors are normally distributed, least squares regression is best. Short-tailed errors are not so much of a problem but long-tailed error distributions can cause difficulties because a few extreme cases can have a large effect on the fitted model.

M-estimation

M-estimates modify the least squares idea to choose \(\beta\) to minimize: \[ \sum_{i=1}^n \rho\left(y_i-x_i^T \beta\right) \] Some possible choices among many for \(\rho\) are: 1. \(\rho(x)=x^2\) is simply least squares. 2. \(\rho(x)=|x|\) is called least absolute deviation (LAD) regression or \(L_1\) regression. 3. \[ \rho(x)=\left\{\begin{array}{cc} x^2 / 2 & \text { if }|x| \leq c \\ c|x|-c^2 / 2 & \text { otherwise } \end{array}\right. \] is called Huber's method and is a compromise between least squares and LAD regression. \(c\) should be a robust estimate of \(\sigma\). A value proportional to the median of \(|\hat{\varepsilon}|\) is suitable. M-estimation is related to weighted least squares. The normal equations tell us that: \[ X^T(y-X \hat{\beta})=0 \] With weights and in non-matrix form this becomes: \[ \sum_{i=1}^n w_i x_{i j}\left(y_i-\sum_{j=1}^p x_{i j} \beta_j\right)=0 \quad j=1, \ldots p \] Now differentiating the M-estimate criterion with respect to \(\beta_j\) and setting to zero we get: \[ \sum_{i=1}^n \rho^{\prime}\left(y_i-\sum_{j=1}^p x_{i j} \beta_j\right) x_{i j}=0 \quad j=1, \ldots p \] Now let \(u_i=y_i-\sum_{j=1}^p x_{i j} \beta_j\) to get: \[ \sum_{i=1}^n \frac{\rho^{\prime}\left(u_i\right)}{u_i} x_{i j}\left(y_i-\sum_{j=1}^p x_{i j} \beta_j\right)=0 \quad j=1, \ldots p \] so we can make the identification of a weight function as \(w(u)=\rho^{\prime}(u) / u\). We find for our choices of \(\rho\) above that: 1. LS: \(w(u)\) is constant and the estimator is simply ordinary least squares. 2. LAD: \(w(u)=1 /|u|\). We see how the weight goes down as \(u\) moves away from zero so that more extreme observations get downweighted. Unfortunately, there is an asymptote at zero. This makes a weighting approach to fitting an LAD regression infeasible without some modification. 3. Huber: \[ w(u)=\left\{\begin{array}{cc} 1 & \text { if }|u| \leq c \\ c /|u| & \text { otherwise } \end{array}\right. \] We can see that this sensibly combines the downweighting of extreme cases with equal weighting for the middle cases.

There are many other choices for \(\rho\) that have been proposed. Computing an Mestimate requires some iteration because the weights depend on the residuals. The fitting methods alternate between fitting a WLS and recomputing the weights based on the residuals until convergence. We can get standard errors via WLS by vâr \(\hat{\beta}=\) \(\hat{\sigma}^2\left(X^T W X\right)^{-1}\) but we need to use a robust estimate of \(\sigma^2\)

Least Trimmed Squares

LTS minimizes the sum of squares of the \(q\) smallest residuals, \(\sum_{i=1}^q \hat{\varepsilon}_{(i)}^2\) where \(q\) is some number less than \(n\) and (i) indicates sorting. This method has a high breakdown point because it can tolerate a large number of outliers depending on how \(q\) is chosen。

Transformation

Box-Cox transformation

The Box-Cox method is designed for strictly positive responses and chooses the transformation to find the best fit to the data. The method transforms the response \(y \rightarrow g_\lambda(y)\) where the family of transformations indexed by \(\lambda\) is: \[ g_\lambda(y)= \begin{cases}\frac{y^\lambda-1}{\lambda} & \lambda \neq 0 \\ \log y & \lambda=0\end{cases} \] For fixed \(y>0, g_\lambda(y)\) is continuous in \(\lambda\). Choose \(\lambda\) using maximum likelihood. The profile log-likelihood assuming normality of the errors is: \[ L(\lambda)=-\frac{n}{2} \log \left(\operatorname{RSS}_\lambda / n\right)+(\lambda-1) \sum \log y_i \] where \(\operatorname{RSS}_\lambda\) is the residual sum of squares when \(g_\lambda(y)\) is the response. You can compute \(\hat{\lambda}\) numerically to maximize this. One way to check the necessery is to form a confidence interval for \(\lambda\). A \(100(1-\alpha) \%\) confidence interval for \(\lambda\) is: \[ \left\{\lambda: \quad L(\lambda)>L(\hat{\lambda})-\frac{1}{2} \chi_1^{2(1-\alpha)}\right\} \] This interval can be derived by inverting the likelihood ratio test of the hypothesis that \(H_0: \lambda=\lambda_0\) which uses the statistic \(2\left(L(\hat{\lambda})-L\left(\lambda_0\right)\right)\) having approximate null distribution \(\chi_1^2\). The confidence interval also tells you how much it is reasonable to round λ.

Some general considerations concerning the Box-Cox method are: 1. The Box-Cox method gets upset by outliers - if you find \(\hat{\lambda}=5\), then this is probably the reason - there can be little justification for actually making such an extreme transformation. 2. If some \(y_i<0\), we can add a constant to all the \(y\). This can work provided the constant is small, but this is an inelegant solution. 3. If \(\max _i y_i / \min _i y_i\) is small, then the Box-Cox will not have much real effect because power transforms are well approximated by linear transformations over short intervals far from the origin. 4. There is some doubt whether the estimation of \(\lambda\) counts as an extra parameter to be considered in the degrees of freedom. This is a difficult question since \(\lambda\) is not a linear parameter and its estimation is not part of the least squares fit.

The Box-Cox method is not the only way of transforming the predictors. Another family of transformations is given by \(g_\alpha(y)=\log (y+\alpha)\).

For responses that are proportions (or percentages), the logit transformation, \(\log (y /(1-y))\), is often used, while for responses that are correlations, Fisher's z transform, \(y=0.5 \log ((1+y) /(1-y))\), is worth considering.

Other Methods

Broken Stick Regression
Polynomials
Splines

Model Selection

It is sensitive to outliers.

Test Based

Backward Elimination

Start with all the predictors in the model
Remove the predictor with the highest \(p\)-value greater than \(\alpha\)
Refit the model and go to step 2
Stop when all \(p\)-values are less than \(\alpha\) \(\alpha>0.05\) may be better if prediction is the goal .

Forward Elimination

Start with no predictor variables
For all predictors not in the model, check the \(p\)-value if they are added to the model
Add the one with the smallest \(p\)-value less than \(\alpha\)
Refit the model and go to step 2
Stop when no new predictors can be added

Stepwise regression

Is a combination of backward elimination and forward selection (allows to add variables back after they have been removed).

Criterion Based

General idea: choose the model that optimizes a criterion which balances goodness-of-fit and model size.

AIC and BIC

Akaike information criterion (AIC)\[\mathrm{AIC}=n \ln (\mathrm{RSS} / n)+2(p+1)\]
Bayes information criterion (BIC)

\[ \mathrm{BIC}=n \ln (\mathrm{RSS} / n)+(p+1) \ln n \] Pick a model that minimize AIC or BIC

Adjusted \(R^2\)

Add predictor will not necessarily increase adjusted \(R^2\)
Maximizing \(R_a^2\) is equivalent to minimizing \(\sigma^2\)

Mallows' \(\mathrm{C}_{\mathrm{p}}\)

Definition: \(C_p=\frac{R S S_p}{\hat{\sigma}^2}+2(p+1)-n\)
\(\hat{\sigma}^2\) is estimated from the model with all predictors
\(R S S_p\) is from the model with \(p\) predictors
Goal: minimize \(C_p\).

Shrinkage Methods

PCA
Partial Least Squares
Ridge Regression
Bias - Variance Trade-off \[ \mathbf{E}(\hat{\mathbf{y}}-\mathbf{E}(\mathbf{y}))^2={(E(\hat{y})-E(y))^2}+E(\hat{y}-E(\hat{y}))^2 \]
Lasso Regression

Factor Model

Linear Model Analysis for one factor models can be interpreted as comparing within and between group variances.

Welch t test:

\[ t=\frac{m_A-m_B}{\sqrt{\frac{s_A^2}{n_A}+\frac{s_B^2}{n_B}}} \]

\(S A\) 和 \(S B\) 分别是 \(A\) 组和 \(B\) 组的标准差 Welch' s t-test的自由度估计如下 \[ d f=\left(\frac{s_A^2}{n_A}+\frac{s_B^2}{n_B}\right)^2 /\left(\frac{s_A^4}{n_A^2\left(n_A-1\right)}+\frac{s_B^4}{n_B^2\left(n_B-1\right)}\right) \]

分析步骤:

导入数据
Q-Q plot查看是否满足正态分布
检查方差齐性
方差齐用 Student's t-test，方差不齐用 Welch' s t-test

Pairwise Comparisons

Pairwise comparison: A pairwise comparison of level \(i\) and \(j\) can be made using a CI for \(\alpha_i-\alpha_j\) using: \[\hat{\alpha}_i-\hat{\alpha}_j \pm t_{d f}^{\alpha / 2} \operatorname{se}\left(\hat{\alpha}_i-\hat{\alpha}_j\right)\] where \(\operatorname{se}\left(\hat{\alpha}_i-\hat{\alpha}_j\right)=\hat{\sigma} \sqrt{1 / J_i+1 / J_j}\) and \(\mathrm{df}=n-I\) in this case. A test for \(\alpha_i=\alpha_j\) amounts to seeing whether zero lies in this interval or not.

Multiple comparison

Tukey's honest significant difference (HSD): It depends on the studentized range distribution which arises as follows. Let \(X_1, \ldots, X_n\) be i.i.d. \(N\left(\mu, \sigma^2\right)\) and let \(R=\) \(\max _i X_i-\min _i X_i\) be the range. Then \(R / \hat{\sigma}\) has the studentized range distribution \(q_{n, v}\) where \(\nu\) is the number of degrees of freedom used in estimating \(\sigma\). The Tukey CIs are: \[\hat{\alpha}_i-\hat{\alpha}_j \pm \frac{q_{I, d f}}{\sqrt{2}} \hat{\mathrm{o}}\left(1 / J_i+1 / J_j\right)\] When the level sample sizes \(J_i\) are very unequal, Tukey's HSD test may become too conservative. We compute the Tukey HSD bands for the B-A difference.

注：方差不齐可以用 Tamhane Test