Limited-memory BFGS
Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning.[1][2]
Like the original BFGS, L-BFGS uses an estimation to the inverse Hessian matrix to steer its search through variable space, but where BFGS stores a dense n×n approximation to the inverse Hessian (n being the number of variables in the problem), L-BFGS stores only a few vectors that represent the approximation implicitly. Due to its resulting linear memory requirement, the L-BFGS method is particularly well suited for optimization problems with a large number of variables. Instead of the inverse Hessian Hk, L-BFGS maintains a history of the past m updates of the position x and gradient ∇f(x), where generally the history size m can be small (often m<10). These updates are used to implicitly do operations requiring the Hk-vector product.
Algorithm
L-BFGS shares many features with other quasi-Newton algorithms, but is very different in how the matrix-vector multiplication for finding the search direction is carried out  . There are multiple published approaches using a history of updates to form this direction vector. Here, we give a common approach, the so-called "two loop recursion."[3][4]
. There are multiple published approaches using a history of updates to form this direction vector. Here, we give a common approach, the so-called "two loop recursion."[3][4]
We'll take as given  , the position at the
, the position at the  -th iteration, and
-th iteration, and  where
 where  is the function being minimized, and all vectors are column vectors. We also assume that we have stored the last
 is the function being minimized, and all vectors are column vectors. We also assume that we have stored the last  updates of the form
 updates of the form  and
 and  . We'll define
. We'll define  , and
, and  will be the 'initial' approximate of the inverse Hessian that our estimate at iteration
 will be the 'initial' approximate of the inverse Hessian that our estimate at iteration  begins with.
Then we can compute the (uphill) direction as follows:
 begins with.
Then we can compute the (uphill) direction as follows:
For




For


Stop with
This formulation is valid whether we are minimizing or maximizing. Note that if we are minimizing, the search direction would be the negative of z (since z is "uphill"), and if we are maximizing,  should be negative definite rather than positive definite. We would typically do a backtracking line search in the search direction (any line search would be valid, but L-BFGS does not require exact line searches in order to converge).
 should be negative definite rather than positive definite. We would typically do a backtracking line search in the search direction (any line search would be valid, but L-BFGS does not require exact line searches in order to converge).
Commonly, the inverse Hessian  is represented as a diagonal matrix, so that initially setting
 is represented as a diagonal matrix, so that initially setting  requires only an element-by-element multiplication.
 requires only an element-by-element multiplication.
This two loop update only works for the inverse Hessian. Approaches to implementing L-BFGS using the direct approximate Hessian  have also been developed, as have other means of approximating the inverse Hessian.[5]
 have also been developed, as have other means of approximating the inverse Hessian.[5]
Applications
L-BFGS has been called "the algorithm of choice" for fitting log-linear (MaxEnt) models and conditional random fields with  -regularization.[2]
-regularization.[2]
Variants
Since BFGS (and hence L-BFGS) is designed to minimize smooth functions without constraints, the L-BFGS algorithm must be modified to handle functions that include non-differentiable components or constraints. A popular class of modifications are called active-set methods, based on the concept of the active set. The idea is that when restricted to a small neighborhood of the current iterate, the function and constraints can be simplified.
L-BFGS-B
The L-BFGS-B algorithm extends L-BFGS to handle simple box constraints (aka bound constraints) on variables; that is, constraints of the form li ≤ xi ≤ ui where li and ui are per-variable constant lower and upper bounds, respectively (for each xi, either or both bounds may be omitted).[6][7] The method works by identifying fixed and free variables at every step (using a simple gradient method), and then using the L-BFGS method on the free variables only to get higher accuracy, and then repeating the process.
OWL-QN
Orthant-wise limited-memory quasi-Newton (OWL-QN) is an L-BFGS variant for fitting  -regularized models, exploiting the inherent sparsity of such models.[2]
It minimizes functions of the form
-regularized models, exploiting the inherent sparsity of such models.[2]
It minimizes functions of the form
where  is a differentiable convex loss function. The method is an active-set type method: at each iterate, it estimates the sign of each component of the variable, and restricts the subsequent step to have the same sign. Once the sign is fixed, the non-differentiable
 is a differentiable convex loss function. The method is an active-set type method: at each iterate, it estimates the sign of each component of the variable, and restricts the subsequent step to have the same sign. Once the sign is fixed, the non-differentiable  term becomes a smooth linear term which can be handled by L-BFGS. After a L-BFGS step, the method allows some variables to change sign, and repeats the process.
 term becomes a smooth linear term which can be handled by L-BFGS. After a L-BFGS step, the method allows some variables to change sign, and repeats the process.
O-LBFGS
Schraudolph et al. present an online approximation to both BFGS and L-BFGS.[8] Similar to stochastic gradient descent, this can be used to reduce the computational complexity by evaluating the error function and gradient on a randomly drawn subset of the overall dataset in each iteration. It has been shown that O-LBFGS has a global almost sure convergence [9] while the online approximation of BFGS (O-BFGS) is not necessarily convergent.[10]
Implementations
An early, open source implementation of L-BFGS in Fortran exists in Netlib as a shar archive . Multiple other open source implementations have been produced as translations of this Fortran code (e.g. java, and python via SciPy). Other implementations exist:
- fmincon (Matlab optimization toolbox)
- FMINLBFGS (for Matlab, BSD license)
- minFunc (also for Matlab)
- LBFGS-D (in the D programming language))
- Frequently as part of generic optimization libraries (e.g. Mathematica, FuncLib C# library, and dlib C++ library)
- The libLBFGS is a C implementation.
- Maximization in Two-Class Logistic Regression (in Microsoft Azure ML)
Implementations of variants
The L-BFGS-B variant also exists as ACM TOMS algorithm 778.[7] In February 2011, some of the authors of the original L-BFGS-B code posted a major update (version 3.0).
A reference implementation[11] is available in Fortran 77 (and with a Fortran 90 interface) at the author's website. This version, as well as older versions, has been converted to many other languages, including a Java wrapper for v3.0; Matlab interfaces for v3.0, v2.4, and v2.1; a C++ interface for v2.1;
a Python interface for v3.0 as part of scipy.optimize.minimize; an OCaml interface for v2.1 and v3.0; version 2.3 has been converted to C by f2c and is available at this website; and R's optim general-purpose optimizer routine includes L-BFGS-B by using method="L-BFGS-B".[12]
There exists a complete C++11 rewrite of the L-BFGS-B solver using Eigen3.
OWL-QN implementations are available in:
- C++ implementation by its designers, includes the original ICML paper on the algorithm[2]
- The CRF toolkit Wapiti includes a C implementation
- libLBFGS
Works cited
- ↑ Malouf, Robert (2002). A comparison of algorithms for maximum entropy parameter estimation (PDF). Proc. Sixth Conf. on Natural Language Learning (CoNLL). pp. 49–55.
- 1 2 3 4 Andrew, Galen; Gao, Jianfeng (2007). "Scalable training of L₁-regularized log-linear models". Proceedings of the 24th International Conference on Machine Learning. doi:10.1145/1273496.1273501. ISBN 9781595937933.
- ↑ Matthies, H.; Strang, G. (1979). "The solution of non linear finite element equations". International Journal for Numerical Methods in Engineering 14 (11): 1613–1626. doi:10.1002/nme.1620141104.
- ↑ Nocedal, J. (1980). "Updating Quasi-Newton Matrices with Limited Storage". Mathematics of Computation 35 (151): 773–782. doi:10.1090/S0025-5718-1980-0572855-7.
- ↑ Byrd, R. H.; Nocedal, J.; Schnabel, R. B. (1994). "Representations of Quasi-Newton Matrices and their use in Limited Memory Methods". Mathematical Programming 63 (4): 129–156. doi:10.1007/BF01582063.
- ↑ Byrd, R. H.; Lu, P.; Nocedal, J.; Zhu, C. (1995). "A Limited Memory Algorithm for Bound Constrained Optimization". SIAM J. Sci. Comput. 16 (5): 1190–1208. doi:10.1137/0916069.
- 1 2 Zhu, C.; Byrd, Richard H.; Lu, Peihuang; Nocedal, Jorge (1997). "L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization". ACM Transactions on Mathematical Software 23 (4): 550–560. doi:10.1145/279232.279236.
- ↑ N. Schraudolph, J. Yu, and S. Günter (2007). A stochastic quasi-Newton method for online convex optimization. AISTATS.
- ↑ A. Mokhtari and A. Ribeiro (2015). "Global convergence of online limited memory BFGS" (PDF). Journal of Machine Learning Research 16: 3151–3181.
- ↑ A. Mokhtari and A. Ribeiro (2014). "RES: Regularized Stochastic BFGS Algorithm". IEEE Transactions on Signal Processing 62 (23): 6089–6104. doi:10.1109/TSP.2014.2357775.
- ↑ Morales, J. L.; Nocedal, J. (2011). "Remark on "algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound constrained optimization"". ACM Transactions on Mathematical Software 38: 1. doi:10.1145/2049662.2049669.
- ↑ "General-purpose Optimization". R documentation. Comprehensive R Archive Network.
Further reading
- Liu, D. C.; Nocedal, J. (1989). "On the Limited Memory Method for Large Scale Optimization". Mathematical Programming B 45 (3): 503–528. doi:10.1007/BF01589116.
- Byrd, Richard H.; Lu, Peihuang; Nocedal, Jorge; Zhu, Ciyou (1995). "A Limited Memory Algorithm for Bound Constrained Optimization". SIAM Journal on Scientific and Statistical Computing 16 (5): 1190–1208. doi:10.1137/0916069.
- Haghighi, Aria (2 Dec 2014). "Numerical Optimization: Understanding L-BFGS".
| 
 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


