Newton’s Method Optimization: Derivation and How It Works

In machine learning, we usually define a cost function that measures our error between our prediction and the ground truth data. And we will use the cost function to build a good hypothesis function, either for classifier or regressor, by minimizing the cost function. We already discuss here, here and here how we can use LSE (Least Square Estimation) to minimize our cost function for regression problem. Those problem are opimization problems, finding the optimal parameter that gives the minimal error of cost function. And now, we will learn another powerful and important method for optimization, which is newton’s method. I will divide this discussion into three parts: (1) newton’s method, (2) taylor series, and (3) newton’s method for optimization.

(1) Newton method – finding root of an equation

Newton method is originally intended to find root of an equation. For example, we have an equation x^2+x-2=0 , it is easy to find roots of this equation, by decomposing (x+2)(x-1), we get that the roots of this equation are x=-2 and x=1. But, what about complex equation to be solved by computer? We can use numerical method to find the root of an equation, one of them is by using Newton’s method. See picture below.

We have f(x) like picture above, we can find root location by using Newton’s method. Suppose we are now in axis x_o, first, we will find tangent line of f(x) in axis x_0. Tangent line in this picture is the red-dash line. This first tangent line intersects axis x in x_1. Then, we find next tangent line of f(x) in x_1. This tangent line will intersect axis x in x_2. By iterating this process, we will reach the root location. That is the idea how Newton’s method for finding root.

Now, let’s try to derive the mathematical formula of Newton’s method. Tangent line of f(x) in x_o is a linear line that is parallel with f(x) in axis x_0. In high school mathematics, we already know that to get such linear function, we can find by:

y=mx+c , where gradient m is same with gradient of f(x) at x_0 = f'(x_0)

To find the value of c, we can plug a pair x-y values of f(x) to y=mx+c. In this case, our x=x_o and y=f(x_0). Here we go.


After getting c, our tangent line becomes.

y=mx+c\\\\ y=f'(x_0)x+f(x_0)-f'(x_0)x_0\\\\ y=f(x_0)+f'(x_0)[x-x_0]

We know that our tangent line intersects axis x at (x_1, 0). Thus, by plugging this coordinate to our tangent line function, we get:

y=f(x_0)+f'(x_0)[x-x_0] \\\\ 0=f(x_0)+f'(x_0)[x_1-x_0]\\\\ x_1-x_0=-\frac{f(x_0)}{f'(x_0)}\\\\ x_1=x_0-\frac{f(x_0)}{f'(x_0)}

Thus, our Newton’s method is by iterating this equation \boxed{x_{n+1}=x_n-\frac{f(x_n)}{f'(x_n)}}.


(2) Taylor series – approximating a function

Taylor series is an approximation of a function using series expansion. See picture below.

Source :

Picture above illustrates approximation of red line using Taylor series with some order n (blue line) in x=0. Higher order, more precise it will be. From this illustration, we can approximate any function using Taylor series. Taylor series of function f(x) around axis x=a can be written below.

f(x)\approx f(a)+\frac{f'(a)}{1!}(x-a)+\frac{f''(a)}{2!}(x-a)^2+\frac{f'''(a)}{3!}(x-a)^3+...


(3) Newton’s method for optimization

After we know how Newton’s method works for finding root, and Taylor series for approximating a function, we will try to expand our Newton’s method for optimization, finding the minimal value of a function. For how it works, see picture below.
Blue line is our cost function, and we want to minimize it until reaching the optimal point. This will be done very similar with what we already discuss in Newton’s method for finding root before. We can say that Newton’s method for finding root, it uses first order method. In this case, tangent line can be considered as first order approximation of f(x). Whereas, Newton’s method for optimization, it use second order method. In other words, we approximate f(x) using Taylor series with second order approximation. This approximation is shown with orange and red-dash lines in picture above.

Suppose we are now in x_0, first, we find second order approximation of f(x) in x_0, then, find the minimal value location of it. In picture above, it would be at x_1. This can be done by taking first differential, and make it equal to zero. Then, we take next approximation of f(x) in x_1, and find the minimal value location of it. Iterate this process until we reach optimal point.

Now, it’s time to derive the mathematical formula of Newton’s method for optimization. 🙂  Second approximation of f(x) around axis x=a is as follows.

f(x)\approx f(a)+\frac{f'(a)}{1!}(x-a)+\frac{f''(a)}{2!}(x-a)^2\\\\ f(x)\approx f(a)+f'(a)x-f'(a)a+\frac{1}{2}f''(a)x^2-\frac{1}{2}2f''(a)ax+\frac{1}{2}f''(a)a^2\\\\ f(x)\approx \frac{1}{2}f''(a)x^2+[f'(a)-f''(a)a]x+[f(a)-f'(a)a+\frac{1}{2}f''(a)a^2]

Then, in order to get the minimal value location of approximation above, we take the first differential, and make it equal to zero. Here we go.

\frac{f(x)}{dx}\approx f''(a)x+[f'(a)-f''(a)a]=0\\\\ f''(a)x=f''(a)a-f'(a)\\\\ x = \frac{f''(a)a}{f''(a)}-\frac{f'(a)}{f''(a)}\\\\ x = a-\frac{f'(a)}{f''(a)}

Voila! We just derived our Newton’s method for optimization. To find minimal value of our cost function in machine learning, we can iterate using this equation \boxed{x_{n+1} = x_n-\frac{f'(x_n)}{f''(x_n)}}.



4 thoughts on “Newton’s Method Optimization: Derivation and How It Works

  1. Unfortunately the graphic for Newton’s method is really misleading. The shown parabolas are angled, which is wrong. A parabola cannot be angled, it is always symmetric around it’s minimum. The thing that can change is it’s curvature, controlled by the leading coefficient.

  2. Hi, Ardianumam, thanks for the illustration of the interesting extension of newton’s method. I agree with ProVen that you should revise the graphical illustration with the parabolas. It is not correct, as it shows the parabolas that would result if the solution is very close to the optimum, where secon derivative is ‘almost’ like the second derivative at the optimum. As you defined the algorithm, however, you are taking f”(x_n), and when n=0 the second derivative might not look at all like the one you need to get close to the optimum. As a matter of fact, the function you picked as an example has negative curvature at x_0, which would make x_1 be further away from the optimum as x_0, since f'(x_0)>0. This illustrates the possible instabilities of all the Newton’s Methods (note that f”(x_n) may become close to zero). If, however, you had some good guess about what the second derivative is like at the optimum, and you used that in every step, then your sketch is correct (up to the directional adjustment pointed out by ProVen), and the convergence might work if there is no local minimum. It should also work as you define it, as long as the function is convex (as only the sign of the derivative then decides the sign of the correction in each step).

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s