linear regression with matrices and gradients

thumbnail

1. Introduction

In computer science class, we briefly touched upon the idea about doing linear regression with matrices, but never truly learned how it worked. Instead, we only coded linear regression with gradient descent.

And, on my airplane to my winter destination, seeing as I had pure, unfiltered time, I decided to take a closer look.

This article is meant to be read with pencil and paper in hand (especially if these concepts are unfamiliar to you).

2. Establishing our regression formula

We’re only going to be considering the simplest function (a straight line)! The function of our original scatter plot can be written as its regressed line, and for each point (x,y)(x,y), an “error term.” The error term is each point’s distance from the line:

yi=β0+β1xi+ϵiy_i = \beta_0 + \beta_1 x_i + \epsilon_i

where i=1,,ni = 1,\ldots,n. Now, letting i=1,,ni = 1,\ldots,n, we obtain:

y1=β0+β1x1+ϵ1y_1 = \beta_0 + \beta_1 x_1 + \epsilon_1 y2=β0+β1x2+ϵ2y_2 = \beta_0 + \beta_1 x_2 + \epsilon_2

so on and so forth for each point. We can write this using matrix notation in a far more efficient way.

3. Properties of a matrix (addition, multiplication, etc.)

Matrix multiplication: I can give you a definition along the lines of “add and multiply each row with each possible column.” Here is my somewhat less convoluted example:

[abcd][efgh]=[ae+bgaf+bhce+dgcf+dh]\begin{bmatrix}a & b\\c & d \end{bmatrix} \cdot \begin{bmatrix}e & f\\g & h\end{bmatrix} = \begin{bmatrix}ae+bg & af+bh\\ce+dg & cf+dh \end{bmatrix}

As for matrix addition, we simply add up the corresponding elements:

[abcd]+[efgh]=[a+eb+fc+gd+h]\begin{bmatrix}a & b\\c & d \end{bmatrix} + \begin{bmatrix}e & f\\g & h\end{bmatrix} = \begin{bmatrix}a+e & b+f\\c+g & d+h \end{bmatrix}

Next, the “transposition” of a matrix is simply a matrix with its rows and columns swapped. If an item is in row ii and column jj, it is now in row jj and column ii:

A=[abcdfghij],AT=[adhbficgj]A = \begin{bmatrix} a& b & c\\d & f & g \\ h &i & j \end{bmatrix}, \quad A^T = \begin{bmatrix} a& d & h\\b & f & i \\ c & g & j \end{bmatrix}

It is important to note that (AT)T=A(A^T)^T = A.

Last, the identity matrix. It’s like the number 1: AI=AAI = A for every AA, where II has 11s on the diagonal and 00 everywhere else.

The inverse matrix A1A^{-1} is the matrix you multiply AA by to get the corresponding identity matrix II.

4. Rewriting our collection of equations as matrices

Okay, now that our basics are covered, let’s go ahead and try and find some formulas! We can notice that:

[1x11x21xn][β0β1]=[β0+β1x1β0+β1x2β0+β1xn]\begin{bmatrix}1 & x_1 \\1 & x_2 \\\vdots & \vdots \\1 & x_n \end{bmatrix}\begin{bmatrix}\beta_0 \\\beta_1\end{bmatrix} = \begin{bmatrix} \beta_0 + \beta_1 x_1 \\ \beta_0 + \beta_1 x_2 \\ \vdots \\ \beta_0 + \beta_1 x_n \end{bmatrix}

which is what we have in our original set of functions! (Try multiplying it out yourself for small values of nn if you are not easily convinced.)

So, we set:

X=[1x11x21xn],β=[β0β1],Y=[y1y2yn],ϵ=[ϵ1ϵ2ϵn]X = \begin{bmatrix}1 & x_1 \\1 & x_2 \\\vdots & \vdots \\1 & x_n \end{bmatrix}, \quad \beta = \begin{bmatrix}\beta_0 \\\beta_1\end{bmatrix}, \quad Y = \begin{bmatrix} y_1 \\y_2 \\\vdots \\y_n \end{bmatrix}, \quad \epsilon = \begin{bmatrix} \epsilon_1 \\\epsilon_2 \\\vdots \\\epsilon_n \end{bmatrix}

And now, we can write our set of functions as:

Y=Xβ+ϵY = X\beta + \epsilon

We have a wonderfully short and simple statement that we can then manipulate!

5. Finding a formula for the error given the coefficients

We now write a function for the error of prediction. Since we want to find the value of β\beta that gives the least error, we solve for ϵ\epsilon and write it as a function:

e(β)=YXβe(\beta) = Y - X\beta

In class, we learned the function for mean squared error as:

MSE(β)=1ni=1nei2(β)\text{MSE}(\beta) = \frac{1}{n} \sum_{i=1}^n e^2_i(\beta)

In terms of our matrices, multiplying a column matrix by its transposition gives i=1nei2\sum_{i=1}^n e^2_i, so we can write:

MSE(β)=1neTe\text{MSE}(\beta) = \frac{1}{n}e^Te

How wonderfully convenient! Writing in terms of XX and YY using e(β)=YXβe(\beta) = Y - X\beta:

1n(YXβ)T(YXβ)=1n(YTβTXT)(YXβ)\frac{1}{n}(Y - X\beta)^T(Y - X\beta) = \frac{1}{n}(Y^T - \beta^TX^T)(Y-X\beta)

Expanding:

1n(YTYYTXββTXTY+βTXTXβ)\frac{1}{n}(Y^TY - Y^TX\beta - \beta^TX^TY + \beta^TX^TX\beta)

Since YTXβY^TX\beta is a 1×11\times 1 matrix, it equals its own transpose: YTXβ=βTXTYY^TX\beta = \beta^TX^TY. So:

MSE(β)=1n(YTY2βTXTY+βTXTXβ)\text{MSE}(\beta) = \frac{1}{n}(Y^TY - 2\beta^TX^TY + \beta^TX^TX\beta)

6. Finding the gradient

Since we want to find the values of β\beta that minimize the mean squared error, we calculate the “slope” of the function and find the point at which the slope is 0. We take the “gradient” with respect to β\beta.

More formally, the gradient of the function ff with respect to xx is:

xf(x)=[fx1fx2fxn]\nabla_x f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

So, let’s find MSE(β)\nabla \text{MSE}(\beta). We distribute the gradient:

=1n(YTY2βTXTY+βTXTXβ)= \frac{1}{n}(\nabla Y^TY - \nabla 2\beta^TX^TY + \nabla\beta^TX^TX\beta)

First piece

Since YTYY^TY is not impacted by β\beta, it is effectively a constant and its gradient is 00.

Second piece

We can rewrite 2βTXTY\nabla 2\beta^TX^TY as βTc\nabla \beta^Tc where c=2XTYc = 2X^TY is constant. Using sum notation:

ββTc=βi=1nβici\nabla_{\beta} \beta^T c = \nabla_{\beta} \sum_{i=1}^{n} \beta_i c_i

Since every term is constant except βici\beta_i c_i, and βiβici=ci\frac{\partial}{\partial \beta_i} \beta_i c_i = c_i:

β(βTc)=c\nabla_{\beta} (\beta^T c) = c

Therefore: 2βTXTY=2XTY\nabla 2\beta^TX^TY = 2X^TY

Third piece

We let A=XTXA = X^TX and work with f(β)=βTAβ=i,jβiAijβjf(\beta) = \beta^TA\beta = \sum_{i,j} \beta_i A_{ij} \beta_j. Taking the partial derivative with respect to βk\beta_k and using casework, then noting that AA is symmetric (A=XTXA = X^TX implies Aij=AjiA_{ij} = A_{ji}):

βTXTXβ=2XTXβ\nabla \beta^TX^TX\beta = 2X^TX\beta

6. Finale

Putting it all together:

MSE(β)=1n(02XTY+2XTXβ)=2n(XTXβXTY)\nabla \text{MSE}(\beta) = \frac{1}{n}(0 - 2X^TY + 2X^TX\beta) = \frac{2}{n}(X^TX\beta - X^TY)

At the optimum β^\hat{\beta}, this equals zero:

XTXβ^XTY=0    β^=XTYXTXX^TX\hat{\beta} - X^TY = 0 \implies \hat{\beta} = \frac{X^TY}{X^TX}

Given sets of points formatted into a matrix as we originally established, we can find good estimates for our coefficients!

7. Multivariable and nonlinear extensions

The multivariable and nonlinear extensions use the same formula. For a quadratic model:

yi=β0+β1xi+β2xi2+ϵiy_i = \beta_0 + \beta_1x_i + \beta_2x^2_i + \epsilon_i

We add another column for x2x^2 and another row for β2\beta_2:

X=[1x1x121x2x221xnxn2],β=[β0β1β2]X = \begin{bmatrix}1 & x_1 & x^2_1\\1 & x_2 & x^2_2\\\vdots & \vdots & \vdots \\1 & x_n & x^2_n \end{bmatrix}, \quad \beta = \begin{bmatrix}\beta_0 \\\beta_1 \\\beta_2\end{bmatrix}

This XX matrix is called the design matrix. The order is arbitrary — you just need to make sure that the XX columns correspond with the rows of β\beta.

We can then go on to talk about properties of the hat matrix, fitted values, residuals, covariances, Gaussian noise, expectations, etc. — but that requires a stronger linear algebra prerequisite and will be saved for another article.


← Back