notes on regression

thumbnail

Expectation Rules

Let’s first review rules for expectation, variance, and covariance so I won’t have to go through it throughout my notes:

Best prediction given random Y

We can write the error for a given value as (Ym)2(Y-m)^2, where YY is a random, real value and mm is our prediction. The MSE, or the expected value of this is: E[(Ym)2]\mathbb{E}[(Y-m)^2]. We can define the MSE as:

We can also introduce the following formulas for any values ZZ, YY and mm:

Var(Z)=E[(ZE[Z])2]\text{Var}(Z) = \mathbb{E}[(Z - \mathbb{E}[Z])^2]

and:

E[Z2]=(E[Z])2+Var(Z)\mathbb{E}[Z^2] = (\mathbb{E}[Z])^2 + \text{Var}(Z)

Combining these two formulas results in a something called bias-variance decomposition, whcih we can apply to our function for MSE:

E[(Ym)2]=(E[Ym])2+Var(Ym)\mathbb{E}[(Y-m)^2] = (\mathbb{E}[Y-m])^2 + \text{Var}(Y-m)

We can also rewrite: E[Ym]=E[Y]m \mathbb{E}[Y-m] = \mathbb{E}[Y] - m, because subtracting a constant doesn’t change the expected value!

Also: Var(Ym)=Var(Y)\text{Var}(Y-m) = \text{Var}(Y), because subtracting a constant does not change variance.

So, we can rewrite the MSE as:

MSE(m)=(E[Y]m)2+Var(Y)\text{MSE}(m) = (\mathbb{E}[Y]-m)^2 + \text{Var}(Y)

Since our prediction is mm, we’re trying to minimize the (E[Y]m2)(\mathbb{E}[Y]-m^2) term. We can turn to calulus, take the derivitave, and set it equal to zero to find the “minimum” of our function.

We can use the chain rule:

ddm(E[Y]m)2=2(E[Y]m)\frac{\text{d}}{\text{d}m} (\mathbb{E}[Y]-m)^2 = -2(\mathbb{E}[Y]-m)

Now we set it equal to zero:

2(E[Y]m)=0    E[Y]=m-2 (\mathbb{E}[Y]-m) = 0 \implies \mathbb{E}[Y] = m

This means that the best single number prediction of a random variable under squared error loss is just the mean! (This is decently intuitive, but it’s cool to go out and derive it yourself).

Prediction one random variable from another

Let’s say we observe XX and want to predict YY. If X=xX=x, we predict m(x)m(x).

The law of total expectation states that:

E[Ym(X)2]=E[E[(Ym(C))2X]]\mathbb{E}[Y-m(X)^2] = \mathbb{E}[\mathbb{E}[(Y-m(C))^2 | X]]

Now we restrict m(X)=β0+β1Xm(X) = \beta_0 + \beta_1 X so that we can find the optimial linear predictor (just for now)! So:

MSE(β0,β1)=E[(Y(β0+β1X))2]\text{MSE}(\beta_0, \beta_1) = \mathbb{E}[(Y-(\beta_0 + \beta_1X))^2]

We can multiply this out and distribute the expectation:

E[Y22Y(β0+β1X)+(β0+β1X)2]\mathbb{E}[Y^2 - 2Y(\beta_0 + \beta_1X) + (\beta_0 + \beta_1X)^2] =E[Y2]2β0E[Y]2β1E[XY]+E[(β0+β1X)2]= \mathbb{E}[Y^2] - 2\beta_0\mathbb{E}[Y] - 2\beta_1 \mathbb{E}[XY] + \mathbb{E}[(\beta_0 + \beta_1X)^2]

We can deal with the last term:

E[(β0+β1X)2]=E[β02+2β0β1X+β12X2]\mathbb{E}[(\beta_0 + \beta_1X)^2] = \mathbb{E}[\beta^2_0 + 2\beta_0\beta_1X + \beta^2_1X^2]

We then distribute the expectation, noting that E[β02]=β02\mathbb{E}[\beta^2_0] = \beta^2_0 , E[2β0β1X]=2β0β1E[X]\mathbb{E}[2\beta_0\beta_1X] = 2\beta_0\beta_1\mathbb{E}[X], E[β12X2]=β12E[X2]\mathbb{E}[\beta_1^2X^2] = \beta_1^2 \mathbb{E} [X^2]

Plugging that ito our original function gives us:

MSE(β0,β1)=E[Y2]2β0E[Y]2β1E[XY]+β02+2β0β1E[X]+b12E[X2]\text{MSE}(\beta_0, \beta_1) = \mathbb{E}[Y^2] - 2\beta_0\mathbb{E}[Y] - 2\beta_1 \mathbb{E}[XY] + \beta^2_0 + 2\beta_0\beta_1 \mathbb{E}[X] + b^2_1 \mathbb{E}[X^2]

Let’s find values of β0\beta_0 and β1\beta_1 that minimizes the function. We can use our same derivitave trick we did earlier to find a minimum, starting with β0\beta_0. We need a partial derivitave this time since we’re dealing with multiple varibles β0\beta_0 and β1\beta_1, starting with β0\beta_0. We’re eliminating all constants without β0\beta_0

MSEβ0=β0(β022β0E[Y]+2β0β1E[X])\frac{\partial \text{MSE}}{\partial \beta_0} = \frac{\partial}{\partial \beta_0} (\beta^2_0 - 2\beta_0 \mathbb{E}[Y] + 2\beta_0\beta_1 \mathbb{E}[X])

I’m going to skip over some trivial calculus steps, so if you can’t follow, then you should learn calculus before reading this article. This is equal to:

2β02E[Y]+2β1E[X]2\beta_0 - 2\mathbb{E}[Y] + 2\beta_1 \mathbb{E}[X]

Setting equal to 00, we can then solve and get β0=E[Y]E[Y]\beta_0 = \mathbb{E}[Y] - \mathbb{E}[Y]

We can do the same thing for β1\beta_1:

MSEβ1=β1(2β1E[XY]+2β0β1E[X]+b12E[X2])\frac{\partial \text{MSE}}{\partial \beta_1} = \frac{\partial}{\partial \beta_1}(-2\beta_1 \mathbb{E}[XY] + 2\beta_0 \beta_1 \mathbb{E}[X] + b^2_1 \mathbb{E} [X^2])

Which is equal to:

2E[XY]+2β0E[X]+2β1E[X2]-2 \mathbb{E}[XY] + 2 \beta_0 \mathbb{E}[X] + 2\beta_1 \mathbb{E}[X^2]

Now we set it equal to 0, and with some careful algebra algebra we get E[XY]=β0E[X]+β1E[X]2+β1E[X2]\mathbb{E}[XY] = \beta_0 \mathbb{E}[X] + \beta_1 \mathbb{E}[X]^2 + \beta_1 \mathbb{E} [X^2]

…to be continued!


← Back