A Better Path to the OLS Solution

statistics

Author

Dean Hansen

Published

7/21/25

When taking an introductory course in regression analysis, the first model you’ll often encounter is linear regression. I’ve written out the key assumptions of linear regression here so that the main course of the post is easier to digest.

The relationship between the input and output is linear and is defined as \(y := X\beta + \epsilon\)
The matrix \(X\) has full column rank, so \(X^T X\) is invertible.
The error term has zero mean and is uncorrelated with the input \(\mathbb{E}[ \, \epsilon \, | \, X \,] = 0\).
- Often one assumes the errors follow a Gaussian distribution. This assumption is not necessary for the Gauss-Markov theorem to hold, but it is often applied to cases where the output is a (scalar) continuous random variable.
The error term has constant variance \(\operatorname{Var}(\, \epsilon \, | \, X \,) = \sigma^2 \cdot \operatorname{I}_n\).

Now that we have the key assumptions in mind, we can progress to the main course.

When you fit a linear regression model in R or Python, the output is a vector of coefficients \(\hat{\beta}\) which are selected to minimize the error-sum-of-squares given by

\[ SSE(\beta) = \sum e^T e = (y - X\beta)^T(y - X\beta).\] In the case of linear regression, there is an exact solution to this optimization problem which can be found using matrix calculus and linear algebra. This exact solution is famous enough that it goes by the name Ordinary Least Squares (OLS). If you need a refresher on the derivation of the OLS estimate, go here.

Now, when I first learned how to derive the OLS solution myself, it felt a bit mysterious. The use of matrix calculus seemed unnecessary to solve such a seemingly simple problem. Further, I’d often forget what the final solution for \(\hat{\beta}\) looked like due to the opaqueness of the derivation.

Since I like to keep things simple, I recently realized that there is an easier way to find the OLS solution which avoids matrix calculus altogether.

If we again consider the problem we’re trying to solve, we want to find the best \(\hat{\beta}\) to plug into our equation \(y_{obs} = X_{obs} \beta\) for \(\beta\), such that the error sum of squares is minimized. My moment of clarity came when I stopped considering the statistical part of the problem, and focused on the task at hand, which was solving a linear equation for the unknown \(\beta\). Recognizing that one cannot invert \(X\), which is \(n\) by \(p\), the derivation becomes straightforward and is given below.

\[\begin{align*} y &= X\beta + \epsilon \\[0.75em] \mathbb{E}[\, y \, | \,X \, ] &= \mathbb{E}[\, X\beta \, | \,X \, ] + \mathbb{E}[\, \epsilon \, | \,X \, ] \\[0.75em] \mathbb{E}[\, y \, | \,X \, ] &= X\beta + 0 \\[0.75em] \mathbb{E}[\, y \, | \,X \, ] &= X\beta \\[0.75em] y &= X\beta \\[0.75em] {\color{red}{\mathbf{X^T}}} \cdot y &= {\color{red}{\mathbf{X^T}}} \cdot X \beta \\[0.75em] X^T \cdot y &= (X^T X) \cdot \beta \\[0.75em] (X^T X)^{-1} \cdot X^T \cdot y &= (X^T X)^{-1} \cdot (X^T X) \cdot \beta \\[0.75em] (X^T X)^{-1} X^T \cdot y &= I \cdot \beta \\[0.75em] (X^T X)^{-1} X^T \cdot y &= \beta \\[0.75em] \implies \hat{\beta} &= (X^T X)^{-1} X^T \cdot y \end{align*}\]

Okay, so what just happened.

I intentionally made the derivation a bit long, especially the first few lines where we start from the assumed statistical model, just so it is abundantly clear what is going on. Now, I’m going to redo it, this time keeping only the key steps in the derivation.

\[\begin{align*} y &= X\beta \\[0.75em] {\color{red}{\mathbf{X^T}}} \cdot y &= {\color{red}{\mathbf{X^T}}} \cdot X \beta \\[0.75em] X^T \cdot y &= (X^T X) \cdot \beta \\[0.75em] (X^T X)^{-1} \cdot X^T \cdot y &= (X^T X)^{-1} \cdot (X^T X) \cdot \beta \\[0.75em] (X^T X)^{-1} X^T \cdot y &= \beta \\[0.75em] \implies \hat{\beta} &= (X^T X)^{-1} X^T \cdot y \end{align*}\]

Much better.

As you can tell from the red, to arrive at the OLS solution, all we did was multiply both sides by \(X^T\) and solve for \(\beta\). This is completely valid, since we assume that \(X^TX\) is invertible. If you compare this to the usual OLS solution derivation, this is much cleaner and easier to re-derive at a moment’s notice.

Anyways, that’s all I’ve got for now.

I hope you found this post useful, and will use this trick as you continue learning about regression analysis!