
Expectation Rules
Let’s first review rules for expectation, variance, and covariance so I won’t have to go through it throughout my notes:
Best prediction given random Y
We can write the error for a given value as (Y−m)2, where Y is a random, real value and m is our prediction. The MSE, or the expected value of this is: E[(Y−m)2]. We can define the MSE as:
We can also introduce the following formulas for any values Z, Y and m:
Var(Z)=E[(Z−E[Z])2]
and:
E[Z2]=(E[Z])2+Var(Z)
Combining these two formulas results in a something called bias-variance decomposition, whcih we can apply to our function for MSE:
E[(Y−m)2]=(E[Y−m])2+Var(Y−m)
We can also rewrite: E[Y−m]=E[Y]−m, because subtracting a constant doesn’t change the expected value!
Also: Var(Y−m)=Var(Y), because subtracting a constant does not change variance.
So, we can rewrite the MSE as:
MSE(m)=(E[Y]−m)2+Var(Y)
Since our prediction is m, we’re trying to minimize the (E[Y]−m2) term. We can turn to calulus, take the derivitave, and set it equal to zero to find the “minimum” of our function.
We can use the chain rule:
dmd(E[Y]−m)2=−2(E[Y]−m)
Now we set it equal to zero:
−2(E[Y]−m)=0⟹E[Y]=m
This means that the best single number prediction of a random variable under squared error loss is just the mean! (This is decently intuitive, but it’s cool to go out and derive it yourself).
Prediction one random variable from another
Let’s say we observe X and want to predict Y. If X=x, we predict m(x).
The law of total expectation states that:
E[Y−m(X)2]=E[E[(Y−m(C))2∣X]]
Now we restrict m(X)=β0+β1X so that we can find the optimial linear predictor (just for now)! So:
MSE(β0,β1)=E[(Y−(β0+β1X))2]
We can multiply this out and distribute the expectation:
E[Y2−2Y(β0+β1X)+(β0+β1X)2]
=E[Y2]−2β0E[Y]−2β1E[XY]+E[(β0+β1X)2]
We can deal with the last term:
E[(β0+β1X)2]=E[β02+2β0β1X+β12X2]
We then distribute the expectation, noting that E[β02]=β02, E[2β0β1X]=2β0β1E[X], E[β12X2]=β12E[X2]
Plugging that ito our original function gives us:
MSE(β0,β1)=E[Y2]−2β0E[Y]−2β1E[XY]+β02+2β0β1E[X]+b12E[X2]
Let’s find values of β0 and β1 that minimizes the function. We can use our same derivitave trick we did earlier to find a minimum, starting with β0. We need a partial derivitave this time since we’re dealing with multiple varibles β0 and β1, starting with β0. We’re eliminating all constants without β0
∂β0∂MSE=∂β0∂(β02−2β0E[Y]+2β0β1E[X])
I’m going to skip over some trivial calculus steps, so if you can’t follow, then you should learn calculus before reading this article. This is equal to:
2β0−2E[Y]+2β1E[X]
Setting equal to 0, we can then solve and get β0=E[Y]−E[Y]
We can do the same thing for β1:
∂β1∂MSE=∂β1∂(−2β1E[XY]+2β0β1E[X]+b12E[X2])
Which is equal to:
−2E[XY]+2β0E[X]+2β1E[X2]
Now we set it equal to 0, and with some careful algebra algebra we get E[XY]=β0E[X]+β1E[X]2+β1E[X2]
…to be continued!