Applied Linear Statistical Methods

UChicago STAT 34300, Autumn 2023

Whenever it is unclear, vectors are column vectors.

Simple Linear Regression


The setup of the simplest model is as follows.

Take x_i \in \mathbb R as predictors (alternatively, independent variables, features, etc), and y_i \in \mathbb R as responses (alternatively, outcomes, etc.). Then, we hopes that the response is approximately linear in the predictors, e.g. y_i = \beta_0 + \beta_1 x_i + \epsilon_i which we split (semantically) into a systemic component and some error.

Our goal is then to minimize the residual sum of squares, e.g.  \mathop{\mathrm{RSS}}(\mathbf \beta) = n^{-1}\sum_{i=1}^n \epsilon_i^2 = n^{-1}\sum_{i=1}^n(y_i - \beta_0 - \beta_1 x_i)^2.

Do this however you want; it doesn’t matter. You will arrive at \begin{align*} \hat \beta_1 &= \frac{\sum_{i=1}^n(x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^n(x_i - \bar x)^2} \\ \hat \beta_0 &= \bar y - \hat \beta_1 \bar x. \end{align*}

Now, none of the above has any particular randomness as defined. However, we can now impose (some of) the following assumptions:

Now for example, if we assume linearity, we can compute that \hat \beta_1 = \beta_1 + \widetilde \epsilon with \mathop{\mathrm{Var}}(\widetilde \epsilon) = \frac{\sum(x_i - \bar x)^2 \mathop{\mathrm{Var}}(\epsilon_i)}{\left(\sum(x_i - \bar x)^2 \right)^2} = \frac{\sigma^2}{\sum(x_i - \bar x)^2} = \sigma_{\bar x}^2 where the last equality only holds under homoskedasticity, and we call the last quantity the standard error.

In this case, we can compute that by the CLT the asymptotic distribution of \hat \beta_{1} is N(\beta_1, \sigma_{\bar x}^2). In particular, we can form a confidence interval by doing the usual stuff.

Testing proceeds the same way: set some null H_0: \beta_1 = c and some alternative H_A: \beta_1 \neq c, and set \delta = \begin{cases} 1 & \text{if } H_0 \text{ is rejected} \\ 0 & \text{otherwise} \\ \end{cases}.

Now take a test statistic T \sim F_0 under the null, with p = 1 - F_0(T); this immediately yields that under the null, p is uniformly distributed under the null (as long as F_0 is continuous), basically by definition. Then, we set some threshold for \alpha, the probability of a Type-I error, and set \delta = 1 \iff p \leq \alpha.

Even under misspecification, we can see that if our samples are i.i.d. distributed x_1, \dots, x_n \sim X and y_1, \dots, y_n \sim Y, \hat \beta_1 = \frac{n^{-1} \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{n^{-1}\sum_{i=1}^n (x_i - \bar x)^2} \overset{D}{\to} \frac{\mathop{\mathrm{Cov}}(X, Y)}{\mathop{\mathrm{Var}}(X)} = \rho_{XY} \cdot \frac{\sigma_X}{\sigma_Y} where \rho_{XY} is the correlation and \sigma_X, \sigma_Y are standard deviations.

Multiple Linear Regression


Now we take x_i \in \mathbb{R}^p and y_i \in \mathbb{R}, and set our model to be y_i = \sum_{j=1}^p\beta_j (x_i)_j + \epsilon_i. There are many different features that one could use.

Notationally, set y = \left[\begin{matrix}y_1 & \dots & y_n\end{matrix}\right]^T \in \mathbb{R}^n, X = \left[\begin{matrix}x_1 & \dots & x_n\end{matrix}\right]^T \in \mathbb{R}^{n \times p} where each x_i is considered as column vector, \beta = \left[\begin{matrix} \beta_1 & \dots & \beta_n \end{matrix}\right] and \epsilon = \left[\begin{matrix} \epsilon_1 & \dots & \epsilon_n \end{matrix}\right]^T so that our model reduces to y = X\beta + \epsilon.

Then, the OLS estimator is given by \hat \beta = \mathop{\mathrm{argmin}}\{ \mathop{\mathrm{RSS}}(\beta) = \| y - X \beta \|^2\} which we can compute directly taking a derivative (so long as n \geq p, e.g. we have more observations than features): \nabla_\beta \mathop{\mathrm{RSS}}(\beta) = -2X^T(Y - X\beta) = 0 \implies X^TX\beta = X^Ty and if X is invertible, we get \hat \beta = (X^TX)^{-1}X^Ty.

Why might X not be invertible? Some examples are

OLS as a Projection

From a geometric perspective, the OLS predictions are just the projections of y into the image of X as a linear map \mathbb{R}^p \to \mathbb{R}^n.

Def: X_{\cdot, 1}, \dots, X_{\cdot p} are orthogonal if \left\langle X_{\cdot j}, X_{\cdot k}\right\rangle = 0 for all j, k, and orthonormal if they are all of length 1.

Theorem: Set V \subset \mathbb{R}^n a linear subspace and y \in \mathbb{R}^n; there exists a unique vector \mathop{\mathrm{proj}}_V(y) = \mathop{\mathrm{argmin}}_{v \in V} \| y - v \|. Furthermore, y - \mathop{\mathrm{proj}}_V(y) \in V^{\perp}, the perpendicular space to V (or equivalently, \left\langle y - \mathop{\mathrm{proj}}_V(y), v \right\rangle = 0 for all v \in V).

There are a few more facts:

Corollary: Set P_X = P_{\mathop{\mathrm{Im}}(X)}, \hat y = P_X y = X \hat \beta. This immediately yields that the “fit” \hat y and the residuals \hat \epsilon are unique, and \|y\|^2 = \| \hat y \|^2 + \| \hat \epsilon \|^2 since \hat \epsilon = (I - P_{X})y.


Reducing to Simple Linear Regression

If one column X_{\cdot j} is orthogonal to every other feature, then this reduces to simple linear regressions: \hat \beta_j = \frac{\left\langle X_{\cdot j}, y\right\rangle}{\| X_{\cdot j} \|^2}.

Using this, set

Then, we see that \hat \beta_p = \frac{\left\langle Z_{\cdot p}, y \right\rangle}{\| Z_{\cdot p}^2\|}, so in general each coefficient depends on other variables, e.g. \hat \beta_j = \frac{\left\langle \text{residuals of } X_{\cdot j} \sim \sum_{k \neq j}X_{\cdot k}, y\right\rangle}{ \| \text{residuals of } X_{\cdot j} \sim \sum_{k \neq j}X_{\cdot k} \|^2}.

In particular, the above (which is clearly just doing Gram-Schmidt) gives us a QR-decomposition of X, where Q = ZD is the orthonormalization of Z and R is the upper triangular matrix corresponding to Gram-Schmidt.

This is (for the most part) what software packages do in computing linear models: you set y = Q\gamma + \epsilon, and take \hat \gamma = Q^Ty; but since the fit is unique, we have that Q \hat \gamma = X \hat \beta = QR\hat \beta \implies R\hat \beta = \hat \gamma, which is numerically tractible.


Def: Link The SVD decomposition of a real (resp. complex) matrix X \in \mathbb{R}^{n \times p} (resp. \mathbb{C}^{n \times p}) is X = U \Sigma V^* where U, V are orthogonal (resp. unitary) and \Sigma is diagonal. Alternatively, we may sometimes use the “skinny” SVD, where U, \Sigma are just n \times p and p \times p matrices, essentially by dropping the zero rows of \Sigma.

SVD is related to eigenvalue decomposition by noting that X^*X = V\Sigma V^* U \Sigma V^* = V(\Sigma^* \Sigma)V^* so that columns of V are eigenvectors of X^*X, and similarly columns of U are eigenvectors of XX^*.

Then, applying to least squares, we have that \begin{align*} \| y - X \beta \|^2 &= \| y - U \Sigma V^T \beta \|^2 \\ &= \| U^Ty - \Sigma V^T B \| \\ &= \| U^Ty - \Sigma \beta^* \|^2 \\ &= \sum_{j=1}^p ((U^Ty)_j - \sigma_j \beta_j^*)^2 \\ &= \sum_{j=1}^k (U^Ty)_j - \sigma_j \beta_j^*)^2 + \sum_{j=k+1}^p (U^Ty)_j)^2 \\ \end{align*}

So minimizing with respect to \beta_j^*, we pick \hat \beta_j^* = \begin{cases} \frac{(U^Ty)_j}{\sigma_j} & j = 1, \dots, k \\ \text{anything} & j = k + 1, \dots, p \end{cases} and in particular if we take the arbitrary values to be 0 we get the “minimal norm” solution and since \beta^* = V^T \beta, the “ridgeless” solution is \hat \beta = V\beta^*, and is \mathop{\mathrm{argmin}}\{ \| \beta \| \mid \beta \in \mathbb{R}^p, X^TX\beta = X^Ty \}.


We covered a bunch of stuff that you can probably find on Wikipedia.

Inference in the Homoskedastic Normal Linear Model

For this section, set X \in \mathbb{R}^{n \times p} design matrix, which is taken to be deterministic and of full rank. Then, set y \sim N(X \beta, \sigma^2 I) for some \sigma > 0, \beta \in \mathbb{R}^p; that is, set Y = X\beta + \epsilon, \epsilon \sim N(0, \sigma^2 I). We have already seen \hat \beta = (X^TX)^{-1}X^TY and so \hat \beta is normal with mean \beta and variance \mathop{\mathrm{Var}}(\hat \beta) = (X^TX)^{-1}X^T(\sigma^2 I)X(X^TX)^{-1} = \sigma^2 (X^TX)^{-1}.

But now we need to understand \sigma: we may split Y into \hat Y = P_XY and \hat \epsilon = (I - P_X)Y, where P_X is the orthogonal projection matrix onto X; properties of the multivariate Gaussian imply that these two are independent, and have distributions \begin{align*} \hat Y &\sim N(X \beta, \sigma^2 P_X) \\ \hat \epsilon &\sim N(0, \sigma^2(I - P_X)) \end{align*}

which we can now use \hat \epsilon to learn about \sigma^2 and \hat Y to learn about \beta.

Specifically, we use the residual sum of squares RSS = \| \hat \epsilon \|^2 \sim \sigma^2 \chi^2_{n-p}; thus this has expectation \sigma^2(n - p), so we can set \hat \sigma^2 = \frac{\| \epsilon^2 \|}{n - p} \sim \frac{\sigma^2 \chi^2_{n - p}}{n - p}. Moreover, since X^T = X^TP_X, we already know that \hat \beta only depends on \hat y, and so is independent of \hat \sigma.


We can now do inference for \beta_j; in particular \hat \beta_j \sim N(\beta_j, \sigma^2 (X^TX)_{jj}^{-1}). Then we know that if we take \begin{align*} A &= \frac{\hat \beta_j - \beta_j}{\sigma \sqrt{(X^TX)^{-1}_jj}} \sim N(0, 1) \\ B &= \frac{\hat \sigma^2 }{\sigma^2} \sim \frac{\chi^2_{n-p}}{n - p} \end{align*} so A / \sqrt{B} \sim t_{n - p}; that is, \frac{\hat \beta_j - \beta_j}{\hat \sigma \sqrt{(X^TX)^{-1}_jj}} \sim t_{n - p} so we can get a concrete handle on the sampling distributions from only our observations.