Coordinate descent

Coordinate descent is a non-derivative optimization algorithm. To find a local minimum of a function, one does line search along one coordinate direction at the current point in each iteration. One uses different coordinate directions cyclically throughout the procedure.

Description

Coordinate descent is based on the idea that the minimization of a multivariable function $F(\mathbf{x})$ can be achieved by minimizing it along one direction at a time, i.e., solving univariate (or at least much simpler) optimization problems in a loop.^[1] In the simplest case of cyclic coordinate descent, one cyclically iterates through the directions, one at a time, minimizing the objective function with respect to each coordinate direction at a time. That is, in each iteration, for each variable $k$ of the problem in turn, the algorithm solves the optimization problem

x^{k+1}_i = \underset{y\in\mathbb R}{\operatorname{arg\,min}}\; f(x^{k+1}_1, \dots, x^{k+1}_{i-1}, y, x^k_{i+1}, \dots, x^k_n).

Thus, one begins with an initial guess $\mathbf{x}^0$ for a local minimum of $F$ , and get a sequence $\mathbf{x}^0, \mathbf{x}^1, \mathbf{x}^2, \dots$ iteratively.

By doing line search in each iteration, one automatically has

F(\mathbf{x}^0)\ge F(\mathbf{x}^1)\ge F(\mathbf{x}^2)\ge \dots.

It can be shown that this sequence has similar convergence properties as steepest descent. No improvement after one cycle of line search along coordinate directions implies a stationary point is reached.

This process is illustrated below.

Differentiable case

In the case of a continuously differentiable function $F$ , a coordinate descent algorithm can be sketched as:^[1]

Choose an initial parameter vector $x$ .
Until convergence is reached, or for some fixed number of iterations:
- Choose an index $i$ from $1$ to $n$ .
- Choose a step size $α$ .
- Update $x i$ to $x i - α \partial F / \partial x i (x)$ .

The step size can be chosen in various ways, e.g., by solving for the exact minimizer of $f (x i) = F (x)$ (i.e., $F$ with all variables but $x i$ fixed), or by traditional line search criteria.^[1]

Limitations

Coordinate descent has problems with non-smooth functions. The following picture shows that coordinate descent iteration may get stuck at a non-stationary point if the level curves of a function are not smooth. Suppose that the algorithm is at the point $(-2, -2)$ ; then there are two axis-aligned directions it can consider for taking a step, indicated by the red arrows. However, every step along these two directions will increase the objective function's value (assuming a minimization problem), so the algorithm will not take any step, even though both steps together would bring the algorithm closer to the optimum.

Applications

Coordinate descent algorithms are popular with practitioners owing to their simplicity, but the same property has led optimization researchers to largely ignore them in favor of more interesting (complicated) methods.^[1] This situation has changed with the advent of large-scale problems in machine learning, where coordinate descent has been shown competitive to other methods when applied to such problems as training linear support vector machines^[2] (see LIBLINEAR) and non-negative matrix factorization.^[3] They are attractive for problems where computing gradients is infeasible, perhaps because the data required to do so are distributed across computer networks.^[4]

References

1 2 3 4 Wright, Stephen J. (2015). "Coordinate descent algorithms". Mathematical Programming 151 (1): 3–34. arXiv:1502.04759. doi:10.1007/s10107-015-0892-3.
↑ Hsieh, C. J.; Chang, K. W.; Lin, C. J.; Keerthi, S. S.; Sundararajan, S. (2008). "A dual coordinate descent method for large-scale linear SVM". Proceedings of the 25th international conference on Machine learning - ICML '08 (PDF). p. 408. doi:10.1145/1390156.1390208. ISBN 9781605582054.
↑ Hsieh, C. J.; Dhillon, I. S. (2011). Fast coordinate descent methods with variable selection for non-negative matrix factorization (PDF). Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '11. p. 1064. doi:10.1145/2020408.2020577. ISBN 9781450308137.
↑ Nesterov, Yurii (2012). "Efficiency of coordinate descent methods on huge-scale optimization problems" (PDF). SIAM J. Optimization 22 (2): 341–362. doi:10.1137/100802001.

Bezdek, J. C.; Hathaway, R. J.; Howard, R. E.; Wilson, C. A.; Windham, M. P. (1987), "Local convergence analysis of a grouped variable version of coordinate descent", Journal of Optimization theory and applications (Kluwer Academic/Plenum Publishers) 54 (3), pp. 471–477, doi:10.1007/BF00940196
Bertsekas, Dimitri P. (1999). Nonlinear Programming, Second Edition Athena Scientific, Belmont, Massachusetts. ISBN 1-886529-00-0.
Canutescu, AA; Dunbrack, RL (2003), "Cyclic coordinate descent: A robotics algorithm for protein loop closure.", Protein science 12 (5), pp. 963–72, doi:10.1110/ps.0242703, PMID 12717019 .
Luo, Zhiquan; Tseng, P. (1992), "On the convergence of the coordinate descent method for convex differentiable minimization", Journal of Optimization theory and applications (Kluwer Academic/Plenum Publishers) 72 (1), pp. 7–35, doi:10.1007/BF00939948 .
Wu, TongTong; Lange, Kenneth (2008), "Coordinate descent algorithms for Lasso penalized regression", The Annals of Applied Statistics (Institute of Mathematical Statistics) 2 (1), pp. 224–244, doi:10.1214/07-AOAS147 .
Richtarik, Peter; Takac, Martin (April 2011), "Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function", Mathematical Programming (Springer), doi:10.1007/s10107-012-0614-z .
Richtarik, Peter; Takac, Martin (December 2012), "Parallel coordinate descent methods for big data optimization", arXiv:1212.0873 .

Optimization: Algorithms, methods, and heuristics

Unconstrained nonlinear: Methods calling …

… functions

… and gradients

Convergence	Trust region Wolfe conditions

Quasi–Newton	BFGS and L-BFGS DFP Symmetric rank-one (SR1)

Other methods	Gauss–Newton Gradient Levenberg–Marquardt Conjugate gradient Truncated Newton

… and Hessians

Newton's method

The graph of a strictly concave quadratic function is shown in blue, with its unique maximum shown as a red dot. Below the graph appears the contours of the function: The level sets are nested ellipses.

Constrained nonlinear

General	Barrier methods Penalty methods

Differentiable	Augmented Lagrangian methods Sequential quadratic programming Successive linear programming

Convex optimization

Convex
minimization

Linear and
quadratic

Interior point	Affine scaling Ellipsoid algorithm of Khachiyan Projective algorithm of Karmarkar

Basis-Exchange	Simplex algorithm of Dantzig Revised simplex algorithm Criss-cross algorithm Principal pivoting algorithm of Lemke

Combinatorial

Paradigms

Graph
algorithms

Minimum spanning tree	Bellman–Ford Borůvka Dijkstra Floyd–Warshall Johnson Kruskal

Network flows

Metaheuristics

Evolutionary algorithm Hill climbing Local search Simulated annealing Tabu search

Categories
- Algorithms and methods
- Heuristics
Software

This article is issued from Wikipedia - version of the Sunday, January 10, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.