# How Gradient Descent Optimization Works for Minimizing Error

We already learn how we can use newton method to minimize error iteratively here. In this post, we will discuss about how we can use gradient descent method to minimize error interatively. When we later talk about neural network (deep neural network will be deep learning), we can say that understanding gradient descent is half of understanding neural network. And in fact, gradient descent is really easy to understand, likewise neural network. It’s true and not exaggerated 😀

Let’s talk about how gradient descent works first. This is also used in other machine learning methods, such as in logistic regression for binary classification here. Let’s assume we have function shown picture below to be minimized.

Our goal is to find the value of $x$ that minimizes $Y$. In the picture above is the pink point. To achieve that point, we will use the gradient information, and we know that gradient actually represents the slope our our function. Then case is, when we have positive gradient, the curve $Y$ will increase when $x$ increases. And when we have negative gradient, the curve $Y$ will decrease when $x$ increases. Thus, how to update our $x$ value so that our $Y$ decreases constantly? The answer is by subtracting our current $x$ with the gradient, specifically weighted gradient. In this case, weighting value will be how big step we want to make in every interation update. Mathematically, gradient descent is written as follow. $x_{n+1}=x_n-\lambda \frac{df(x)}{dx} \\\\ \boxed {x_{n+1}=x_n-\lambda \frac{dy}{dx}}$

Let’s put an example for clearer understanding. In picture above, we assume that we have $Y=f(x)=\frac{1}{2}x^2$. We will demonstrate how we can reach pink point minimizing $Y$ value, in this case the pink point will be $x=0$. Let’s say we are now in $x=10$. Here is how we update our $x$ in every iteration. $\frac{dy}{dx}=\frac{df(x)}{dx}=\frac{d[\frac{1}{2}x^2]}{dx}=2\frac{1}{2}x=x\\ let\,weighting\,value\,\lambda=0.5\\\\ iteration\, 1:\\ x_n=10\\ x_{n+1}=x_n-\lambda \frac{df(x_n)}{dx}=x_n-\lambda x_n=10-0.5(10)=5\\\\ interation\,2:\\ x_n=5\\ x_{n+1}=x_n-\lambda \frac{df(x_n)}{dx}=x_n-\lambda x_n=5-0.5(5)=2.5\\\\ interation\,3:\\ x_n=2.5\\ x_{n+1}=x_n-\lambda \frac{df(x_n)}{dx}=x_n-\lambda x_n=2.5-0.5(2.5)=1.25\\\\ interation\,4:\\ x_n=1.25\\ x_{n+1}=x_n-\lambda \frac{df(x_n)}{dx}=x_n-\lambda x_n=1.25-0.5(1.25)=0.625\\\\ interation\,5:\\ x_n=0.625\\ x_{n+1}=x_n-\lambda \frac{df(x_n)}{dx}=x_n-\lambda x_n=0.625-0.5(0.625)=\boxed{0.3125}$

In only 5 iterations, we already get $x=0.3125$ which is closed to the pink point we already know for this case, which is $x=0$. We can just continue to iterate it, and it will be converged at $x=0$ in our case here. Easy, right?

In machine learning, our $Y=f(x)$ will be our error function/loss function. And by using this gradient descent method, we will try to minimize our error in every iteration update. Again, weighting value $\lambda$ here is how big step we want in every iteration update. Bigger $\lambda$, faster to converge, and vice versa. But, bigger $\lambda$ can be not stable and even cause not being able to converge.

Maybe some of you will ask, how we can know the gradient value of our loss function? In fact we often cannot know the mathematical form of our loss function, not like in the example we know that it is $Y=\frac{1}{2}x^2$. How we can take first differential of it to get gradient? The answer is we will use discrete method when we will implement into a code in computer. In this case, our gradient will be $\frac{dy}{dx}=\frac{y_{new}-y_{old}}{x_{new}-x_{old}}$. “New” here means the value in current iteration, and “old” means the value we get in the previous iteration.

That’s all. Easy but powerful, right?