Bias-Variance Decomposition

During this post, we will try to decompose an error that actually an error consists of bias and variance. We will use MSE (mean square error) to define our error. Let \theta is parameter we want to estimate, and \hat{\theta} is our estimator result. Thus MSE(\hat{\theta})=E[(\hat{\theta}-\theta)^2] and bias(\hat{\theta})=E[\hat{\theta}-\theta]=E[\hat{\theta}]-\theta. Let’s decompose our MSE then.

MSE = E[(\hat{\theta}-\theta)^2]=E[(\hat{\theta}-\mu-(\theta-\mu))^2]\\\\  MSE =E[(\hat{\theta}-\mu)^2-2(\hat{\theta}-\mu)(\theta-\mu)+(\theta-\mu)^2]\\\\  MSE =E[(\hat{\theta}-\mu)^2]-E[2(\hat{\theta}-\mu)(\theta-\mu)]+E[(\theta-\mu)^2]\\\\  MSE =E[(\hat{\theta}-\mu)^2]-2(\theta-\mu)E[(\hat{\theta}-\mu)]+(\theta-\mu)^2\,...\,(i)\\\\  MSE =E[(\hat{\theta}-\mu)^2]-2(\theta-\mu)(E[\hat{\theta}]-E[\mu])+(\theta-\mu)^2\\\\  MSE =E[(\hat{\theta}-\mu)^2]-2(\theta-\mu)(\mu-\mu)+(\theta-\mu)^2\\\\  MSE =E[(\hat{\theta}-\mu)^2]-2(\theta-\mu)(0)+(\theta-\mu)^2\\\\  MSE =E[(\hat{\theta}-\mu)^2]+(\theta-\mu)^2\\\\  MSE =E[(\hat{\theta}-E[\hat{\theta}])^2]+(\mu-\theta)^2\,...(ii)\,\\\\  MSE =E[(\hat{\theta}-E[\hat{\theta}])^2]+(E[\hat{\theta}]-\theta)^2\\\\  MSE = variance(\hat{\theta})+bias^2

We can do (i) because \theta and \mu are just a constant. And for (ii), we can flip \theta-\mu becomes \mu-\theta since we will square it. And finally, the last line we can decompose that error MSE consists of variance and bias. We know that MSE is constant value, thus if we make bias lower, we will have bigger variance. And vice versa.

Bias-variance decomposition relation for machine learning

From decomposing error MSE to bias-varianace above, we know that we can make small both. If we make one smaller, the other one become bigger. It’s same in machine learning. When we train our model to get lower bias, we will have bigger variance. Here is the characteristic.

  • High bias = low variance \rightarrow our model more flexible, stable, general. This is in the case when we set bigger constant of our regularization. Furthermore, this is more prone to underfitting.
  • High variance = low bias \rightarrow our model more sensitive. This is in the case when we set smaller constant of our regularization. Furthermore, this is more prone to overfitting.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s