# Polynomial Regression Using Least Square Estimation

We already discuss about how to estimate linear regression function using least square estimation here. But, many real problem cannot be modeled using only linear model. See picture here.

We cannot fit a line (green line) for those data points (blue dot), otherwise, it will give us a big error. For this case, we need polynomial function to fit those data (red line). We can write our hypothesis/prediction function using polynomial model as follows.

$h(\mathbf{x})=a_0 x^0+a_1 x^1+a_2 x^2+....+a_n x^n$

Equation above is general form of our hypothesis function with polynomial order $= n$. We can represent linear regression by setting order $n= 1$, so that the hypothesis function become :

$h(\mathbf{x})=a_1 x^0+a_2 x=a_1 +a_2 x$

Actually, the rest process is very similar with what we already discuss in linear regression here. The difference is only in the “design matrix” of $\mathbf{x}$. But, it’s OK. I will just discuss again the detail here. Back to our case, we can write our $h(\mathbf{x})$ in matrix notation.

$h(\mathbf{x})=\mathbf{a}^{T}\mathbf{x}$ , with $\mathbf{a}=\begin{bmatrix} a_0\\a_1\\a_2\\...\\a_n\end{bmatrix}$  and  $\mathbf{x}=\begin{bmatrix} x_1\\x_2\\x_3\\...\\x_n\end{bmatrix}$

We can write in another form for $m$ pair input-output prediction in matrix form as follows.

$\begin{bmatrix} h(x_1)\\h(x_2)\\h(x_3)\\...\\h(x_m)\end{bmatrix}= \begin{bmatrix} 1\,\,x_1^1\,\,x_1^2\,...\,x_1^n\\ 1\,\,x_2^1\,\,x_2^2\,...\,x_2^n\\ 1\,\,x_3^1\,\,x_3^2\,...\,x_3^n\\ ...\,\,...\,\,...\,\,...\,\,...\\ 1\,x_m^1\,x_m^2\,...\,x_m^n\end{bmatrix}\begin{bmatrix} a_0\\a_1)\\a_2\\...\\a_n\end{bmatrix}$

$h(\mathbf{x})=\mathbf{Xa}$

In this case, our design matrix is $\begin{bmatrix} 1\,\,x_1^1\,\,x_1^2\,...\,x_1^n\\ 1\,\,x_2^1\,\,x_2^2\,...\,x_2^n\\ 1\,\,x_3^1\,\,x_3^2\,...\,x_3^n\\ ...\,\,...\,\,...\,\,...\,\,...\\ 1\,x_m^1\,x_m^2\,...\,x_m^n\end{bmatrix}$, and we denote it using bold-uppercase of x, $\mathbf{X}$.

Then, we will make cost function to define our error function. We will use average of square error as our basic form. And our purpose is to minimize the error “as small as we can” (see trade-off topic in the bottom of this post). That’s why we name this with “least square estimation. We can write our cost function as follows.

$J(\mathbf{a}) =\frac{1}{2m}\sum_{i=1}^{m}(h(x_i)-y_i)^{2}$

By plugging our $h(\mathbf{x})$ to our $J(\mathbf{a})$ above, we get:

$J(\mathbf{a}) =\frac{1}{2m}(\mathbf{Xa}-\mathbf{y})^{2}$

Then we do some linear algebra to simplify that, we get:

$J(\mathbf{a}) =\frac{1}{2m}(\mathbf{Xa}-\mathbf{y})^T(\mathbf{Xa}-\mathbf{y})$

$J(\mathbf{a}) =\frac{1}{2m}((\mathbf{Xa})^T-\mathbf{y}^T)(\mathbf{Xa}-\mathbf{y})$

$J(\mathbf{a})=\frac{1}{2m}((\mathbf{Xa})^T\mathbf{Xa}-(\mathbf{Xa})^T\mathbf{y}-\mathbf{y}^T\mathbf{Xa}+\mathbf{y}^T\mathbf{y})$

Look carefully to the equation above, each term above will produce scalar value, for example :

$\mathbf{y}^T\mathbf{y}=\begin{bmatrix} y_0\,y_1\,y_2\,...\,y_n\end{bmatrix}\begin{bmatrix} y_0\\y_1\\y_2\\...\\y_n\end{bmatrix}=y_0^2+y_1^2+y_2^2+...+y_n^2=scalar\,value$

Thus, we can change third term, $\mathbf{y}^T(\mathbf{Xa})=(\mathbf{Xa})^T\mathbf{y}$  since these will produce same scalar value (multiplication is commutative). Then, by combining second and third term, we can simplify $J(\mathbf{a})$ become:

$J(\mathbf{a})=\frac{1}{2m}((\mathbf{Xa})^T\mathbf{Xa}-2\mathbf{Xa}^T\mathbf{y}+\mathbf{y}^T\mathbf{y})$

Once again, our purpose is to find a polynomial function that minimizes error, $J(\mathbf{a})$. And in our case, we will find parameter $\mathbf{a}$ that minimizes error, $J(\mathbf{a})$. To do so, we already know in high school math or in undergraduate calculus course that we can take first differential toward $\mathbf{a}$, and make it equal to zero. Here we go.

$\frac{dJ(\mathbf{a})}{d\mathbf{a}}=\frac{1}{2m}(2\mathbf{X}^T\mathbf{Xa}-2\mathbf{Xa}^T\mathbf{y})=0$

$(\mathbf{X}^T\mathbf{Xa}-\mathbf{X}^T\mathbf{y})=0$

$\mathbf{X}^T\mathbf{Xa}=\mathbf{X}^T\mathbf{y}$

$\mathbf{a}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$

Voila! We can get the value of $\mathbf{a}$ for our polynomial function that best fits our data points, assuming that  $(\mathbf{X}^T\mathbf{X})$ is invertible. Thus, re-plugging in to our polynomial model, our hypothesis function become:

$h(\mathbf{x})=a_0 x^0+a_1 x^1+a_2 x^2+....+a_n x^n+1$ with  $\mathbf{a}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$

where design matrix $\mathbf{X}=\begin{bmatrix} 1\,\,x_1^1\,\,x_1^2\,...\,x_1^n\\ 1\,\,x_2^1\,\,x_2^2\,...\,x_2^n\\ 1\,\,x_3^1\,\,x_3^2\,...\,x_3^n\\ ...\,\,...\,\,...\,\,...\,\,...\\ 1\,x_m^1\,x_m^2\,...\,x_m^n\end{bmatrix}$

Again, see! The process is really similar with what we already discuss in linear regression here. The difference is only the design matrix.

Discussing order number trade-off  (capability vs overfitting)

Using higher order, we can have more capability to fit more complex data points. See picture below for example.

From picture above, it is clearly see that the green line with 9 order polynomial function better on fitting the data points compare to those with lower order. But, higher order our model, it will be more prone to be overfitting. Overfitting means that our model is really good for data training in the term of giving really small error value, but it will gave us bigger error in prediction of unseen data. See here for the illustration.

Picture above, red line uses higher order than green line, so that it gives smaller error in the training data compared to the green line’s. But,  see when the input $x=0.9$, it will give big error. In this case, green line will be better than red line to model those data points. At this point, we have to make our model not only give small error, but also make our model more general that will generalize when coping with the prediction data (unseen data), for instance input $x=0.9$ in the picture above.

To avoid overfitting problem, there is a technique called regularization. You can read further here.