# Bias-Variance Decomposition

During this post, we will try to decompose an error that actually an error consists of bias and variance. We will use MSE (mean square error) to define our error. Let $\theta$ is parameter we want to estimate, and $\hat{\theta}$ is our estimator result. Thus $MSE(\hat{\theta})=E[(\hat{\theta}-\theta)^2]$ and $bias(\hat{\theta})=E[\hat{\theta}-\theta]=E[\hat{\theta}]-\theta$. Let’s decompose our MSE then. $MSE = E[(\hat{\theta}-\theta)^2]=E[(\hat{\theta}-\mu-(\theta-\mu))^2]\\\\ MSE =E[(\hat{\theta}-\mu)^2-2(\hat{\theta}-\mu)(\theta-\mu)+(\theta-\mu)^2]\\\\ MSE =E[(\hat{\theta}-\mu)^2]-E[2(\hat{\theta}-\mu)(\theta-\mu)]+E[(\theta-\mu)^2]\\\\ MSE =E[(\hat{\theta}-\mu)^2]-2(\theta-\mu)E[(\hat{\theta}-\mu)]+(\theta-\mu)^2\,...\,(i)\\\\ MSE =E[(\hat{\theta}-\mu)^2]-2(\theta-\mu)(E[\hat{\theta}]-E[\mu])+(\theta-\mu)^2\\\\ MSE =E[(\hat{\theta}-\mu)^2]-2(\theta-\mu)(\mu-\mu)+(\theta-\mu)^2\\\\ MSE =E[(\hat{\theta}-\mu)^2]-2(\theta-\mu)(0)+(\theta-\mu)^2\\\\ MSE =E[(\hat{\theta}-\mu)^2]+(\theta-\mu)^2\\\\ MSE =E[(\hat{\theta}-E[\hat{\theta}])^2]+(\mu-\theta)^2\,...(ii)\,\\\\ MSE =E[(\hat{\theta}-E[\hat{\theta}])^2]+(E[\hat{\theta}]-\theta)^2\\\\ MSE = variance(\hat{\theta})+bias^2$

We can do $(i)$ because $\theta$ and $\mu$ are just a constant. And for $(ii)$, we can flip $\theta-\mu$ becomes $\mu-\theta$ since we will square it. And finally, the last line we can decompose that error MSE consists of variance and bias. We know that MSE is constant value, thus if we make bias lower, we will have bigger variance. And vice versa.

## Bias-variance decomposition relation for machine learning

From decomposing error MSE to bias-varianace above, we know that we can make small both. If we make one smaller, the other one become bigger. It’s same in machine learning. When we train our model to get lower bias, we will have bigger variance. Here is the characteristic.

• High bias = low variance $\rightarrow$ our model more flexible, stable, general. This is in the case when we set bigger constant of our regularization. Furthermore, this is more prone to underfitting.
• High variance = low bias $\rightarrow$ our model more sensitive. This is in the case when we set smaller constant of our regularization. Furthermore, this is more prone to overfitting.