# Understanding Kernel Method/Tricks in Machine Learning

Up to now, we already learn about regression, classification and clustering in our machine learning and pattern recognition post series. During this post, we will learn another powerful method in machine learning, which is kernel method, or also called kernel trick! Why do we use kernel trick? Some reasons are (1) we don’t need to think how to form a design matrix. Just imagine for example if our features are “words”, not number. How can we form design matrix for it? Using kernel method, we can just define our kernel, for example using hamming distance of our “words”, etc. (2) Kernel method provides us a way to project our data into much higher dimensional space, even equals to infinite dimensional space. We can take benefit of it so that our model performs better.

So, how to do kernel trick? We will demonstrate how to do that in our regularized regression. In our regularized regression using LSE we already talk here, we get the loss function to be minimized as follows.

$J(\textbf{a})=\frac{1}{2m}[(\textbf{Xa}-\boldsymbol{y})^2+\lambda \textbf{a}^T\textbf{a}]\\\\ J(\textbf{a})=\frac{1}{2m}[(\textbf{Xa})^T\textbf{Xa}-2\textbf{Xa}^T\textbf{y}+\textbf{y}^T\textbf{y}+\lambda \textbf{a}^T\textbf{a}]$

where $\textbf{X}$ is design matrix, $\textbf{y}$ is vector of our training data $y$, and $\textbf{a}$ is our regression parameter. From what we already talk here too, we already derive that our regression parameter as follows.

$\textbf{a}=(\textbf{X}^T\textbf{X}+\lambda \textbf{I})^{-1}\textbf{X}^T\textbf{y}\\\\ where\,prediction\,\textbf{y}=\textbf{Xa}$

Our regression parameter $\textbf{a}$ above, in kernel trick, is usually mentioned as  primal form. And in the kernel trick, we will try to modify it into so-called “dual form”. To do that, let’s modify our $\textbf{a}$ formula a little bit.

$\textbf{a}=\textbf{X}^T(\textbf{X}^T\textbf{X}+\lambda \textbf{I})^{-1}\textbf{y}\\\\ \textbf{a}=\textbf{X}^T\textbf{b}\,,where \, \textbf{b}=(\textbf{X}^T\textbf{X}+\lambda \textbf{I})^{-1}\textbf{y}$

To find the dual form of regression parameter, we will plug our $\textbf{a}$ above into our loss function, then, instead of minimizing w.r.t $\textbf{a}$, we will minimize our loss function w.r.t $\textbf{b}$. Here we go.

$J(\textbf{a})=\frac{1}{2m}[(\textbf{Xa})^T\textbf{Xa}-2(\textbf{Xa})^T\textbf{y}+\textbf{y}^T\textbf{y}+\lambda \textbf{a}^T\textbf{a}]\\\\ J(\textbf{b})=\frac{1}{2m}[(\textbf{X}\textbf{X}^T\textbf{b})^T\textbf{X}\textbf{X}^T\textbf{b}-2(\textbf{X}\textbf{X}^T\textbf{b})^T\textbf{y}+\textbf{y}^T\textbf{y}+\lambda (\textbf{X}^T\textbf{b})^T\textbf{X}^T\textbf{b}]\\\\ J(\textbf{b})=\frac{1}{2m}[\textbf{b}^T\textbf{XX}^T\textbf{XX}^T\textbf{b}-2\textbf{b}^T\textbf{XX}^T\textbf{y}+\textbf{y}^T\textbf{y}+\lambda\textbf{b}^T\textbf{XX}^T\textbf{b}]$

We will define “kernel” $\textbf{k}=\textbf{XX}^T$. Thus, our derivation will be:

$J(\textbf{b})=\frac{1}{2m}[\textbf{b}^T\textbf{kkb}-2\textbf{b}^T\textbf{ky}+\textbf{y}^T\textbf{y}+\lambda\textbf{b}^T\textbf{kb}]\\\\ J(\textbf{b})=\frac{1}{2m}[\textbf{b}^T\textbf{kkb}+\lambda\textbf{b}^T\textbf{kb}-2\textbf{b}^T\textbf{ky}+\textbf{y}^T\textbf{y}]\\\\ J(\textbf{b})=\frac{1}{2m}[\textbf{b}^T\textbf{k}(\textbf{k}+\lambda\textbf{I})\textbf{b}-2\textbf{b}^T\textbf{ky}+\textbf{y}^T\textbf{y}]$

To minimize our loss function/error function $J(\textbf{b})$, we take the first differential of it with respect to $\textbf{b}$, and make it equals to zero. Here we go.

$\frac{d[J(\textbf{b})]}{d\textbf{b}}=\frac{\frac{1}{2m}[\textbf{b}^T\textbf{k}(\textbf{k}+\lambda\textbf{I})\textbf{b}-2\textbf{b}^T\textbf{ky}+\textbf{y}^T\textbf{y}]}{d\textbf{b}}=0\\\\ 2\textbf{k}(\textbf{k}+\lambda\textbf{I})\textbf{b}-2\textbf{ky}=0\\\\ \not{2}\not{\textbf{k}}(\textbf{k}+\lambda\textbf{I})\textbf{b}=\not{2}\not{\textbf{k}}\textbf{y}\\\\ \boxed{\textbf{b}=(\textbf{k}+\lambda\textbf{I})^{-1}\textbf{y}}$

Voila! We already successfully derive parameter $\textbf{b}$ that minimizes our regression loss function in dual form. To make prediction of $\textbf{y}$, we can just replug $\textbf{b}$ to our prediction formula:

$\textbf{y}=\textbf{Xa}\\\\ \textbf{y}=\textbf{XX}^T\textbf{b},\,\,where\,\,\textbf{a}=\textbf{X}^T\textbf{b}\\\\ \textbf{y}=\textbf{XX}^T(\textbf{k}+\lambda\textbf{I})^{-1}\textbf{y},\,\,where\,\,\textbf{b}=(\textbf{k}+\lambda\textbf{I})^{-1}\textbf{y}\\\\ \boxed{\textbf{y}=\textbf{k}^T(\textbf{k}+\lambda\textbf{I})^{-1}\textbf{y}},\,\,where\,\,\textbf{k}=\textbf{XX}^T$

See, now, we calculate our regression parameter $\textbf{b}$ and prediction $\textbf{y}$ using kernel $\textbf{k}$, and we can define our kernel as what we want! As long it’s still reasonable. Awesome, right? 😀 Some popular kernels are fisher kernel, polynomial kernel and RBF (Radial Basis Fuction) kernel. For example, in RBF kernel, the equal feature space of the kernel has an infinite number of dimensions.

Remember! That kernel is a trick. So, we can use it in other machine learning methods, such as in Bayesian regression, SVM, and so on. In this post, we demonstrate to derive kernel formula, which is bringing our primal form to dual form, for our regularized regression problem. To do for other machine learning methods, we can derive with similar process we already demonstrate in this post.

This video below talks about how we can implement kernel trick in k-means clustering in python.