Linear Regression as a System of Linear Equations
Published:
Linear regression is typically introduced as an optimization problem, but it can be understood more fundamentally as solving a linear system. This perspective clarifies when exact solutions exist and why the least-squares formula emerges naturally from maximum likelihood estimation under Gaussian noise.
Setup
Let $X \in \mathbb{R}^{n \times d}$, $\theta \in \mathbb{R}^{d \times 1}$, and $y \in \mathbb{R}^{n \times 1}$. We are interested in solving the linear system
\[X\theta = y.\qquad(1)\]This system may have zero, one, or infinitely many solutions. The structure is governed entirely by the rank of $X$ and by whether $y$ lies in its column space.
Existence and Uniqueness
A unique solution to $X\theta = y$ exists iff:
At least one solution:
There is at least one $\theta\in \mathbb{R}^d$ such that $X\theta = y$. In other words, $y$ lies in the column space of $X$, i.e. $\operatorname{rank}(X) = \operatorname{rank}([X \mid y]).$At most one solution: $X\phi = 0$ if and only if $\phi=0$, or equivalently, \(\operatorname{rank}(X) = d.\) To see why this ensures uniqueness, suppose $X\theta = X\theta’ = y$. Then $X(\theta - \theta’) = 0$, which implies $\theta - \theta’ = 0$ by the injectivity condition.
If both conditions hold—i.e., $\operatorname{rank}(X) = \operatorname{rank}([X \mid y]) = d$—then there is exactly one $\theta$ satisfying $X\theta = y$.
Closed-Form Solution
Multiply both sides of (1) by $X^T$ to obtain the normal equations:
\[X^TX\theta = X^Ty.\qquad (2)\]Note that $X^TX$ is a $d \times d$ square matrix. To determine when this system has a unique solution, we need to understand when $X^TX$ is invertible.
A key observation is that $\operatorname{ker}(X) = \operatorname{ker}(X^TX)$, which follows from the equivalence
\[X\theta = 0 \Leftrightarrow X^TX\theta = 0.\]To see the forward direction, note that if $X\theta = 0$, then $X^TX\theta = X^T \cdot 0 = 0$. For the reverse, if $X^TX\theta = 0$, then $\theta^T X^TX\theta = \|X\theta\|^2 = 0$, so $X\theta = 0$.
Since $\operatorname{ker}(X) = \operatorname{ker}(X^TX)$, we have $\operatorname{rank}(X) = \operatorname{rank}(X^TX)$. Therefore, if $\operatorname{rank}(X) = d$, then $X^TX$ has full rank and is invertible. In this case the solution of (2) is
\[\theta^* = (X^T X)^{-1} X^T y.\]This formula returns the exact $\theta$ whenever the system is consistent.
It applies cleanly in cases:
- $n > d$ (tall, consistent),
- $n = d$ (square full-rank),
- but not $n < d$ (since $\operatorname{rank}(X) \le n < d$).
To verify that $\theta^*$ satisfies $X\theta^* = y$ when the system is consistent, observe that if $y = X\theta^\dagger$ is the true solution for some $\theta^\dagger$, then
\[X\theta^* = X(X^TX)^{-1}X^Ty = X(X^TX)^{-1}X^TX\theta^\dagger = X\theta^\dagger = y.\]The Regression Setting
In regression we typically have $n \gg d$ and almost never have an exact solution because $y \notin \operatorname{col}(X)$. Instead, we assume a noise model:
\[y = X\theta + \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, \sigma^2 I).\]Note: The only source of uncertainty in this model is the Gaussian noise. Since $y = X\theta + \varepsilon$ with $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$, we have $y \sim \mathcal{N}(X\theta, \sigma^2 I)$. The negative log-likelihood is
\[-\log p(y \mid X, \theta, \sigma^2) = \frac{n}{2}\log(2\pi\sigma^2) + \frac{1}{2\sigma^2}\|y - X\theta\|_2^2.\]Since $\sigma^2$ is a constant with respect to $\theta$, maximizing the log-likelihood (or minimizing the negative log-likelihood) over $\theta$ is equivalent to solving the least-squares problem
\[\min_\theta \|X\theta - y\|_2^2.\]Setting the gradient to zero yields the normal equations:
\[X^T X \theta = X^T y,\]and when $X$ has full column rank, the unique minimizer is
\[\hat{\theta} = (X^T X)^{-1} X^T y.\]In the inconsistent case this does not satisfy $X\theta = y$ exactly, but it is the closest possible solution in squared-error terms and is also the MLE.
