In computer science class, we briefly touched upon the idea about doing linear regression with matrices, but never truly learned how it worked. Instead, we only coded linear regression with gradient descent.
And, on my airplane to my winter destination, seeing as I had pure, unfiltered time, I decided to take a closer look.
This article is meant to be read with pencil and paper in hand (especially if these concepts are unfamiliar to you).
2. Establishing our regression formula
We’re only going to be considering the simplest function (a straight line)! The function of our original scatter plot can be written as its regressed line, and for each point (x,y), an “error term.” The error term is each point’s distance from the line:
yi=β0+β1xi+ϵi
where i=1,…,n. Now, letting i=1,…,n, we obtain:
y1=β0+β1x1+ϵ1y2=β0+β1x2+ϵ2
so on and so forth for each point. We can write this using matrix notation in a far more efficient way.
3. Properties of a matrix (addition, multiplication, etc.)
Matrix multiplication: I can give you a definition along the lines of “add and multiply each row with each possible column.” Here is my somewhat less convoluted example:
[acbd]⋅[egfh]=[ae+bgce+dgaf+bhcf+dh]
As for matrix addition, we simply add up the corresponding elements:
[acbd]+[egfh]=[a+ec+gb+fd+h]
Next, the “transposition” of a matrix is simply a matrix with its rows and columns swapped. If an item is in row i and column j, it is now in row j and column i:
A=adhbficgj,AT=abcdfghij
It is important to note that (AT)T=A.
Last, the identity matrix. It’s like the number 1: AI=A for every A, where I has 1s on the diagonal and 0 everywhere else.
The inverse matrix A−1 is the matrix you multiply A by to get the corresponding identity matrix I.
4. Rewriting our collection of equations as matrices
Okay, now that our basics are covered, let’s go ahead and try and find some formulas! We can notice that:
We have a wonderfully short and simple statement that we can then manipulate!
5. Finding a formula for the error given the coefficients
We now write a function for the error of prediction. Since we want to find the value of β that gives the least error, we solve for ϵ and write it as a function:
e(β)=Y−Xβ
In class, we learned the function for mean squared error as:
MSE(β)=n1i=1∑nei2(β)
In terms of our matrices, multiplying a column matrix by its transposition gives ∑i=1nei2, so we can write:
MSE(β)=n1eTe
How wonderfully convenient! Writing in terms of X and Y using e(β)=Y−Xβ:
n1(Y−Xβ)T(Y−Xβ)=n1(YT−βTXT)(Y−Xβ)
Expanding:
n1(YTY−YTXβ−βTXTY+βTXTXβ)
Since YTXβ is a 1×1 matrix, it equals its own transpose: YTXβ=βTXTY. So:
MSE(β)=n1(YTY−2βTXTY+βTXTXβ)
6. Finding the gradient
Since we want to find the values of β that minimize the mean squared error, we calculate the “slope” of the function and find the point at which the slope is 0. We take the “gradient” with respect to β.
More formally, the gradient of the function f with respect to x is:
∇xf(x)=∂x1∂f∂x2∂f⋮∂xn∂f
So, let’s find ∇MSE(β). We distribute the gradient:
=n1(∇YTY−∇2βTXTY+∇βTXTXβ)
First piece
Since YTY is not impacted by β, it is effectively a constant and its gradient is 0.
Second piece
We can rewrite ∇2βTXTY as ∇βTc where c=2XTY is constant. Using sum notation:
∇ββTc=∇βi=1∑nβici
Since every term is constant except βici, and ∂βi∂βici=ci:
∇β(βTc)=c
Therefore: ∇2βTXTY=2XTY
Third piece
We let A=XTX and work with f(β)=βTAβ=∑i,jβiAijβj. Taking the partial derivative with respect to βk and using casework, then noting that A is symmetric (A=XTX implies Aij=Aji):
∇βTXTXβ=2XTXβ
6. Finale
Putting it all together:
∇MSE(β)=n1(0−2XTY+2XTXβ)=n2(XTXβ−XTY)
At the optimum β^, this equals zero:
XTXβ^−XTY=0⟹β^=XTXXTY
Given sets of points formatted into a matrix as we originally established, we can find good estimates for our coefficients!
7. Multivariable and nonlinear extensions
The multivariable and nonlinear extensions use the same formula. For a quadratic model:
yi=β0+β1xi+β2xi2+ϵi
We add another column for x2 and another row for β2:
X=11⋮1x1x2⋮xnx12x22⋮xn2,β=β0β1β2
This X matrix is called the design matrix. The order is arbitrary — you just need to make sure that the X columns correspond with the rows of β.
We can then go on to talk about properties of the hat matrix, fitted values, residuals, covariances, Gaussian noise, expectations, etc. — but that requires a stronger linear algebra prerequisite and will be saved for another article.