L-BFGS algorithm for multivariate optimization
The Newton method is a classical method. It is studied first in the
"Optimization Methods" course. This old method isn't used, but it
becomes a basis for a whole family of quasi-Newton methods. One of those
methods is the L-BFGS algorithm.
Quasi-Newton methods
The classical Newton method uses the Hessian of a function. The step of
the method is defined as a product of an inverse Hessian matrix and a
function gradient. If the function is a positive definite quadratic
form, we can reach the function minimum in one step. In case of an
indefinite quadratic form (which has no minimum), we will reach the
maximum or saddle point. In short, the method finds the stationary point
of a quadratic form. In practice, we usually have functions which are
not quadratic forms. If such a function is smooth, it is sufficiently
good described by a quadratic form in the minimum neighbourhood.
However, the Newton method can converge both to a minimum and a maximum
(taking a step into the direction of a function increasing).
Quasi-Newton methods solve this problem as follows: they use a positive
definite approximation instead of a Hessian. If Hessian is positive
definite, we make the step using the Newton method. If Hessian is
indefinite, we modify it to make it positive definite, and then perform a
step using the Newton method. The step is always performed in the
direction of the function decrement. In case of a positive definite
Hessian, we use it to generate a quadratic surface approximation. This
should make the convergence better. If Hessian is indefinite, we just
move to where function decreases.
It was stated above that we perform a step using the Newton method.
Actually, it is not exactly so - in that way we just define a direction
in which the step will be performed. Some modifications of quasi-Newton
methods perform a precise linear minimum search along the indicated
line, but it is proved that it's enough to sufficiently decrease the
function value, and not necessary to find a precise minimum value. The
L-BFGS algorithm tries to perform a step using the Newton method. If it
does not lead to a function value decreasing, it lessens the step length
to find a lesser function value.
LBFGS Hessian update scheme
The Hessian of a function isn't always available, more often we can only
calculate the function gradient. Therefore, the following operation is
used: the Hessian of a function is generated on the basis of the N
consequent gradient calculations, and the quasi-Newton step is
performed. There is a special formulas which allows to iteratively get a
Hessian approximation. On each step approximation, the matrix remains
positive definite. The algorithm uses the BFGS update scheme. BFGS
stands for Broyden-Fletcher-Goldfarb-Shanno (more precisely, this scheme
generates not the Hessian, but its inverse matrix, so we don't have to
waste time inverting a Hessian).
The L letter in the scheme name comes from the words "limited memory".
In case of big dimensions, the amount of memory required to store a
Hessian (N 2)
is too big, along with the machine time required to process it.
Therefore, instead of using N gradient values to generate a Hessian we
can use a smaller number of values, which requires a memory capacity of
order of N·M. In practice, M is
usually chosen from 3 to 7, in difficult cases it is reasonable to
increase this constant to 20. Of course, as a result we'll get not the
Hessian but its approximation. On the one hand, the convergence slows
down. On the other hand, the performance could even grow up. At first
sight, this statement is paradoxical. But it contains no contradictions:
the convergence is measured by a number of iterations, whereas the
performance depends on the number of processor's time units spent to
calculate the result.
As a matter of fact, initially this method was designed to optimize the
functions of a number of arguments (hundreds and thousands), because in
this case it is worth having an increasing iteration number due to the
lower approximation precision because the overheads become much lower.
But we can use these methods for small dimension problems too. The main
advantage of the method is scalability, because it provides high
performance when solving high dimensionality problems, and it allows to
solve small dimension problems too.
Difference scheme and analytical gradient
Do not calculate the function gradient on the basis of a two-point
difference formula because it is insufficiently precise. In a number of
cases, the algorithm will not be able to work and will return an error
message. Use at least a four-point formula
or analytical form of the gradient.
Use of Algorithm and Reverse Communication
The optimization algorithm shall obtain values of a function/gradient
during its operation. This problem is solved in most program packages by
transferring the pointer to the function (C++, Delphi) or delegate (C#)
which is used to calculate function value/gradient/Hessian.
The ALGLIB package, differently from other libraries, makes use of
reverse communication to solve this problem. When a value/gradient of a
function needs to be calculated, the algorithm state is stored within a
special structure, control is returned to the calling program, which
makes all calculations and recalls the computing subroutine.
Thus, the optimization algorithm is operated in accordance in the following order:
- LBFGSState data structure preparation by calling algorithm initialization subrotuine MinLBFGS.
- MinLBFGSIteration subroutine call.
- If False is returned from the
subroutine, then the algorithm operation is completed, and the minimum
is found (the minimum itself can be obtained by calling the MinLBFGSResults subroutine).
- If True is returned from the
subroutine, the latter will make a request for information on the
function. The function/gradient shall be calculated (this issue is fully
detailed below).
- The MinLBFGSIteration subroutine needs to be called again after the requested information is loaded into the LMState structure.
The following fields of the LMState structure are used in order to exchange information with the user:
- LMState.X[0..N-1] – An array storing information on coordinates of the point x
- LMState.F – The value of function F(x) should be stored in this field
- LMState.G[0..N-1] – The gradient grad F(x) should be stored in this field
Subroutines Description
Unit lbfgs
— Data structure LBFGSState
This structure stores current state of optimization algorithm between calls of MinLBFGSIteration subroutine.
— Data structure LBFGSReport
This structure stores optimization report: iterations number IterationsCount, number of function/gradient calculations NFEV, completetion code TerminationType.
— Subroutine MinLBFGS(N, M, X, EpsG, EpsF, EpsX, MaxIts, Flags, out State)
This subroutine is used to start optimization. It is called to initialize State structure before calling MinLBFGSIteration. Subroutine parameters are: task dimension N, model rank M, stopping conditions EpsG, EpsF, EpsX, MaxIts, and Flags parameter (see subroutine comments for more information).
— Subroutine MinLMIteration(var State)
— Subroutine MinLMResults(State, X, Rep)
This subroutine is used to obtain algorithm results: minimum found and optimization report. It can be called only after MinLMIteration subroutine has returned False.
Source codes
C#
C# 1.0 source.
lbfgs.csharp.zip - L-BFGS algorithm for multivariate optimization
C++
C++ source.
lbfgs.cpp.zip - L-BFGS algorithm for multivariate optimization
ablas.zip - optimized basic linear algebra subroutines with SSE2 support (for C++ sources only)
C++, multiple precision arithmetic
C++ source. MPFR/GMP is used.
GMP source is available from gmplib.org. MPFR source is available from www.mpfr.org.
lbfgs.mpfr.zip - L-BFGS algorithm for multivariate optimization
mpfr.zip - precompiled Win32 MPFR/GMP binaries
Delphi
Delphi source.
Can be compiled under FPC (in Delphi compatibility mode).
lbfgs.delphi.zip - L-BFGS algorithm for multivariate optimization
Visual Basic
Visual Basic source.
lbfgs.vb6.zip - L-BFGS algorithm for multivariate optimization