Newton's method in optimization

A comparison of gradient descent (green) and Newton's method (red) for minimizing a function (with small step sizes). Newton's method uses curvature information to take a more direct route.

In calculus, Newton's method is an iterative method for finding the roots of a differentiable function $f$ (i.e. solutions to the equation $f(x)=0$ ). In optimization, Newton's method is applied to the derivative $f'$ of a twice-differentiable function $f$ to find the roots of the derivative (solutions to $f'(x)=0$ ), also known as the stationary points of $f$ .

Method

In the one-dimensional problem, Newton's method attempts to construct a sequence $x n$ from an initial guess $x 0$ that converges towards some value $x*$ satisfying $f'(x*)=0$ . This $x*$ is a stationary point of $f$ .

The second order Taylor expansion $f T (x)$ of $f$ around $x n$ is:

f_T(x)=f_T(x_n+\Delta x) \approx f(x_n)+f'(x_n)\Delta x+\frac 1 2 f''(x_n) \Delta x^2

We want to find $Δ x$ such that $f (x n + Δ x)$ is maximum. We seek to solve the equation that sets the derivative of this last expression with respect to $Δ x$ equal to zero:

\displaystyle 0 = \frac{d}{d\Delta x} \left(f(x_n)+f'(x_n)\Delta x+\frac 1 2 f''(x_n) \Delta x^2\right) = f'(x_n)+f'' (x_n) \Delta x

For the value of $Δ x = - f'(x n) / f ″ (x n)$ , which is the solution of this equation, it can be hoped that $x n +1 = x n + Δ x = x n - f'(x n) / f ″ (x n)$ will be closer to a stationary point $x*$ . Provided that $f$ is a twice-differentiable function and other technical conditions are satisfied, the sequence $x 1, x 2, \dots$ will converge to a point $x*$ satisfying $f'(x*)=0$ .

Geometric interpretation

The geometric interpretation of Newton's method is that at each iteration one approximates $f (x)$ by a quadratic function around $x n$ , and then takes a step towards the maximum/minimum of that quadratic function (in higher dimensions, this may also be a saddle point). Note that if $f (x)$ happens to be a quadratic function, then the exact extremum is found in one step.

Higher dimensions

The above iterative scheme can be generalized to several dimensions by replacing the derivative with the gradient, $\nabla f (x)$ , and the reciprocal of the second derivative with the inverse of the Hessian matrix, $H f (x)$ . One obtains the iterative scheme

\mathbf{x}_{n+1} = \mathbf{x}_n - [\mathbf{H}f(\mathbf{x}_n)]^{-1} \nabla f(\mathbf{x}_n), \ n \ge 0.

Often Newton's method is modified to include a small step size $γ \in (0,1)$ instead of $γ = 1$

\mathbf{x}_{n+1} = \mathbf{x}_n - \gamma[\mathbf{H} f(\mathbf{x}_n)]^{-1} \nabla f(\mathbf{x}_n).

This is often done to ensure that the Wolfe conditions are satisfied at each step $x n \to x n +1$ of the iteration.

Where applicable, Newton's method converges much faster towards a local maximum or minimum than gradient descent. In fact, every local minimum has a neighborhood $N$ such that, if we start with $x 0 \in N$ , Newton's method with step size $γ = 1$ converges quadratically (if the Hessian is invertible and a Lipschitz continuous function of $x$ in that neighborhood).

Finding the inverse of the Hessian in high dimensions can be an expensive operation. In such cases, instead of directly inverting the Hessian it's better to calculate the vector $Δ x = x n + 1 - x n$ as the solution to the system of linear equations

[\mathbf{H} f(\mathbf{x}_n)] \mathbf{\Delta x} = -\nabla f(\mathbf{x}_n)

which may be solved by various factorizations or approximately (but to great accuracy) using iterative methods. Many of these methods are only applicable to certain types of equations, for example the Cholesky factorization and conjugate gradient will only work if $[H f (x n)]$ is a positive definite matrix. While this may seem like a limitation, it's often useful indicator of something gone wrong, for example if a minimization problem is being approached and $[H f (x n)]$ is not positive definite, then the iterations are converging to a saddle point and not a minimum.

On the other hand, if a constrained optimization is done (for example, with Lagrange multipliers), the problem may become one of saddle point finding, in which case the Hessian will be symmetric indefinite and the solution of $x n +1$ will need to be done with a method that will work for such, such as the $LDL T$ variant of Cholesky factorization or the conjugate residual method.

There also exist various quasi-Newton methods, where an approximation for the Hessian (or its inverse directly) is built up from changes in the gradient.

If the Hessian is close to a non-invertible matrix, the inverted Hessian can be numerically unstable and the solution may diverge. In this case, certain workarounds have been tried in the past, which have varied success with certain problems. One can, for example, modify the Hessian by adding a correction matrix $B n$ so as to make $H f (x n) + B n$ positive definite. One approach is to diagonalize $H f (x n)$ and choose $B n$ so that $H f (x n) + B n$ has the same eigenvectors as $H f (x n)$ , but with each negative eigenvalue replaced by $ϵ > 0$ .

An approach exploited in the Levenberg–Marquardt algorithm (which uses an approximate Hessian) is to add a scaled identity matrix to the Hessian, $μ I$ , with the scale adjusted at every iteration as needed. For large $μ$ and small Hessian, the iterations will behave like gradient descent with step size $1 / μ$ . This results in slower but more reliable convergence where the Hessian doesn't provide useful information.

Notes

References

Avriel, Mordecai (2003). Nonlinear Programming: Analysis and Methods. Dover Publishing. ISBN 0-486-43227-0.
Bonnans, J. Frédéric; Gilbert, J. Charles; Lemaréchal, Claude; Sagastizábal, Claudia A. (2006). Numerical optimization: Theoretical and practical aspects. Universitext (Second revised ed. of translation of 1997 French ed.). Berlin: Springer-Verlag. pp. xiv+490. doi:10.1007/978-3-540-35447-5. ISBN 3-540-35445-X. MR 2265882.
Fletcher, Roger (1987). Practical methods of optimization (2nd ed.). New York: John Wiley & Sons. ISBN 978-0-471-91547-8. .
Nocedal, Jorge & Wright, Stephen J. (1999). Numerical Optimization. Springer-Verlag. ISBN 0-387-98793-2.
"Newton-Raphson visualization (1D)". .

Isaac Newton

Publications	De analysi per aequationes numero terminorum infinitas (1669, published 1711) Method of Fluxions (1671) De motu corporum in gyrum (1684) Philosophiæ Naturalis Principia Mathematica (1687) General Scholium (1713) Opticks (1704) The Queries (1704) Arithmetica Universalis (1707)

Other writings	Notes on the Jewish Temple Quaestiones quaedam philosophicae The Chronology of Ancient Kingdoms Amended (1728) An Historical Account of Two Notable Corruptions of Scripture (1754)

Newtonianism	Bucket argument Newton's inequalities Newton's law of cooling Newton's law of universal gravitation Post-Newtonian expansion Parameterized post-Newtonian formalism Newton–Cartan theory Schrödinger–Newton equation Gravitational constant Newton's laws of motion Newtonian dynamics Newton's method in optimization Gauss–Newton algorithm Truncated Newton method Newton's rings Newton's theorem about ovals Newton–Pepys problem Newtonian potential Newtonian fluid Classical mechanics Newtonian fluid Corpuscular theory of light Leibniz–Newton calculus controversy Rotating spheres Newton's cannonball Newton–Cotes formulas Newton's method Newton fractal Generalized Gauss–Newton method Newton's identities Newton polynomial Newton's theorem of revolving orbits Newton–Euler equations Newton number Kissing number problem Power number Solar mass Dynamics Absolute time and space Finite difference Table of Newtonian series Impact depth Structural coloration Inertia

Life	Cranbury Park Woolsthorpe Manor Early life of Isaac Newton Later life of Isaac Newton Religious views of Isaac Newton Isaac Newton's occult studies The Mysteryes of Nature and Art Scientific revolution Copernican Revolution

Friends and family	Catherine Barton John Conduitt William Clarke Benjamin Pulleyn William Stukeley William Jones Isaac Barrow Abraham de Moivre John Keill

Discoveries and inventions	Calculus Newton disc Newton polygon Newton–Okounkov body Newton's reflector Newtonian telescope Newton scale Newton's metal Newton's cradle Sextant

Phrases	Hypotheses non fingo Standing on the shoulders of giants

Theory expansions	Kepler's laws of planetary motion Problem of Apollonius

Related	Writing of Principia Mathematica Newton (Blake) List of things named after Isaac Newton Isaac Newton in popular culture Elements of the Philosophy of Newton Isaac Newton S/O Philipose Newton (unit)

Optimization: Algorithms, methods, and heuristics

Unconstrained nonlinear: Methods calling …

… functions

… and gradients

Convergence	Trust region Wolfe conditions

Quasi–Newton	BFGS and L-BFGS DFP Symmetric rank-one (SR1)

Other methods	Gauss–Newton Gradient Levenberg–Marquardt Conjugate gradient Truncated Newton

… and Hessians

Newton's method

The graph of a strictly concave quadratic function is shown in blue, with its unique maximum shown as a red dot. Below the graph appears the contours of the function: The level sets are nested ellipses.

Constrained nonlinear

General	Barrier methods Penalty methods

Differentiable	Augmented Lagrangian methods Sequential quadratic programming Successive linear programming

Convex optimization

Convex
minimization

Linear and
quadratic

Interior point	Affine scaling Ellipsoid algorithm of Khachiyan Projective algorithm of Karmarkar

Basis-Exchange	Simplex algorithm of Dantzig Revised simplex algorithm Criss-cross algorithm Principal pivoting algorithm of Lemke

Combinatorial

Paradigms

Graph
algorithms

Minimum spanning tree	Bellman–Ford Borůvka Dijkstra Floyd–Warshall Johnson Kruskal

Network flows

Metaheuristics

Evolutionary algorithm Hill climbing Local search Simulated annealing Tabu search

Categories
- Algorithms and methods
- Heuristics
Software

This article is issued from Wikipedia - version of the Monday, February 01, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.