Deriving Polynomial Regression with Regularization to Avoid Overfitting

After we discuss about polynomial regression here using LSE (Least Square Error), we know that higher order of polynomial model has more capability to fit more complex data points, but more prone to be overfitting.  Picture below illustrates that red line (using high order) exactly fit those blue dot points, but will give big error, such as in axis 0.9. That is what we called overfitting (away too fit data training). In this case, green line is better, that has more general model to represent those data points.

We can avoid overfitting by using so-called regularization. How does it work? Usually, a function is prone to be overfitting when its coefficients (weighting values) has big value and not well distributed. Thus, we will force our training process to make those coefficients small by adding a term in our cost function. This process also makes those coefficients more well distributed. Here is our new cost function.

$J(\mathbf{a})=\frac{1}{2m}[\sum_{i=1}^{m}(h(x_i)-y_i)^2)+\lambda\sum_{j=1}^{n}a_j^2]$,

with our hypothesis funtion $h(\mathbf{x})=a_0 x^0+a_1 x^1+a_2 x^2+....+a_n x^n$

There are two terms in equation above. Since we will minimize $J(\mathbf{a})$, thus, the first term of equation above is to make our square error as small as possible, and the second term is to make our coefficients also small. This second term is what we called regularization. As for constant $\lambda$, it is to determine how big we want to regularize. If we set  $\lambda$  with a value $> 1$, we make our regularization is more important than our first term minimizing square error. We can try some values of $\lambda$, and choose a value that gives our hypothesis/prediction function best performance.

We will do some linear algebra to simplify our cost function. For first term, we already derive here giving us $\sum_{i=1}^{m}(h(x_i)-y_i)^2 =(\mathbf{Xa})^T\mathbf{Xa}-2\mathbf{Xa}^T\mathbf{y}+\mathbf{y}^T\mathbf{y}$. And the second term, we can simplify as follows.

$\lambda\sum_{j=1}^{n}a_j^2=\lambda na^2$,   for one pair input-output prediction

Re-writing for $m$ pair input-output prediction, we can get.

$(\lambda n \mathbf{I} \mathbf{a})^T\mathbf{a}$,  here $\mathbf{I}$ is identity matrix

Thus, our cost function $J(\mathbf{a})$ becomes:

$J(\mathbf{a})=\frac{1}{2m}[(\mathbf{Xa})^T\mathbf{Xa}-2\mathbf{Xa}^T\mathbf{y}+\mathbf{y}^T\mathbf{y}+(\lambda n \mathbf{I} \mathbf{a})^T\mathbf{a}]$

Again, our purpose is to minimize our cost function. In this case, we will find parameter $\mathbf{a}$ that gives us the minimal value of our cost function, $J(\mathbf{a})$. Similar with what we already do here and here, we will take first differential and make it equal to zero. Here we go.

$J(\mathbf{a})=\frac{1}{2m}[(\mathbf{Xa})^T\mathbf{Xa}-2(\mathbf{Xa})^T\mathbf{y}+\mathbf{y}^T\mathbf{y}+\lambda n \mathbf{I} \mathbf{a}^T\mathbf{a}] \\\\ \frac{dJ(\mathbf{a})}{d\mathbf{a}}=\frac{1}{2m}[2\mathbf{X}^T\mathbf{Xa}-2\mathbf{X}^T\mathbf{y}+2\lambda n \mathbf{I}\mathbf{a}] = 0\\\\ 2\mathbf{X}^T\mathbf{Xa}-2\mathbf{X}^T\mathbf{y}+2\lambda n \mathbf{I} \mathbf{a} = 0\\\\ (\mathbf{X}^T\mathbf{X}+\lambda n \mathbf{I})\mathbf{a}-\mathbf{X}^T\mathbf{y} = 0\\\\ (\mathbf{X}^T\mathbf{X}+\lambda n \mathbf{I})\mathbf{a}=\mathbf{X}^T\mathbf{y}\\\\ \mathbf{a}=(\mathbf{X}^T\mathbf{X}+\lambda n \mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$

Because $\lambda n$ are constants, we can denote it using new single constant, $\lambda_{new}$. And our $\mathbf{a}$ becomes.

$\mathbf{a}=(\mathbf{X}^T\mathbf{X}+\lambda_{new}\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$

Hooray! We already get our parameter $\mathbf{a}$ for our polynomial regression model with regularization. Rewriting our hypothesis model, here is our final model for polynomial regression with regularization.

$h(\mathbf{x})=a_0 x^0+a_1 x^1+a_2 x^2+....+a_n x^n$,

with $\mathbf{a}=(\mathbf{X}^T\mathbf{X}+\lambda_{new}\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$, where design matrix $\mathbf{X}=\begin{bmatrix} 1\,\,x_1^1\,\,x_1^2\,...\,x_1^n\\ 1\,\,x_2^1\,\,x_2^2\,...\,x_2^n\\ 1\,\,x_3^1\,\,x_3^2\,...\,x_3^n\\ ...\,\,...\,\,...\,\,...\,\,...\\ 1\,x_m^1\,x_m^2\,...\,x_m^n\end{bmatrix}$