Optimisers: How to improve the current parameter values

The optimiser varies the model parameters in an attempt to find the solution which minimises the chosen statistic.

In general it is expected that the optimiser will be used by a Fit object to perform the fit, but it can be used directly using the fit() method. The optimiser object allows configuration values to be changed which can tweak the behavior; for instance the tolerance to determine whether the fit has converged, the maximum number of iterations to use, or how much information to display whilst optimising a model.

As an example, the default parameter values for the Levenberg-Marquardt optimiser are:

>>> from sherpa.optmethods.LevMar
>>> lm = LevMar()
>>> print(lm)
name    = levmar
ftol    = 1.19209289551e-07
xtol    = 1.19209289551e-07
gtol    = 1.19209289551e-07
maxfev  = None
epsfcn  = 1.19209289551e-07
factor  = 100.0
verbose = 0

These settings are available both as fields of the object and via the config dictionary field.

Additional optimisers can be built by extending from the sherpa.optmethods.OptMethod class. This can be used to provide access to external packages such as CERN’s MINUIT optimisation library.

Choosing an optimiser

Warning

The following may not correctly represent Sherpa’s current capabilities, so please take care when interpreting this section.

The following information is adapted from a memo written by Mark Birkinshaw (1998).

The minimization of mathematical functions is a difficult operation. A general function \(f({\bf x})\) of the vector argument \(\bf x\) may have many isolated local minima, non-isolated minimum hypersurfaces, or even more complicated topologies. No finite minimization routine can guarantee to locate the unique, global, minimum of \(f({\bf x})\) without being fed intimate knowledge about the function by the user.

This does not mean that minimization is a hopeless task. For many problems there are techniques which will locate a local minimum which may be “close enough” to the global minimum, and there are techniques which will find the global minimum a large fraction of the time (in a probabilistic sense). However, the reader should be aware of my philosophy is that there is no “best” algorithm for finding the minimum of a general function. Instead, Sherpa provides tools which will allow the user to look at the overall behavior of the function and find plausible local minima, which will often contain the physically-meaningful minimum in the types of problem with which Sherpa deals.

In general, the best assurance that the correct minimum has been found in a particular calculation is careful examination of the nature of the solution (e.g., by plotting a fitted function over data), and some confidence that the full region that the minimum may lie in has been well searched by the algorithm used. This document seeks to give the reader some information about what the different choices of algorithm will mean in terms of run-time and confidence of locating a good minimum.

Some points to take away from the discussions in the rest of this document.

Never accept the result of a minimization using a single optimization run; always test the minimum using a different method.

Check that the result of the minimization does not have parameter values at the edges of the parameter space. If this happens, then the fit must be disregarded since the minimum lies outside the space that has been searched, or the minimization missed the minimum.

Get a feel for the range of values of the target function (in Sherpa this is the fit statistic), and the stability of the solution, by starting the minimization from several different parameter values.

Always check that the minimum “looks right” by visualizing the model and the data.

Sherpa contains two types of routine for minimizing a fit statistic. I will call them the “single-shot” routines, which start from a guessed set of parameters, and then try to improve the parameters in a continuous fashion, and the “scatter-shot” routines, which try to look at parameters over the entire permitted hypervolume to see if there are better minima than near the starting guessed set of parameters.

Single-shot techniques

As the reader might expect, the single-shot routines are relatively quick, but depend critically on the guessed initial parameter values \({\bf x}_0\) being near (in some sense) to the minimum \({\bf x}_{\rm min}\). All the single-shot routines investigate the local behaviour of the function near \({\bf x}_0\), and then make a guess at the best direction and distance to move to find a better minimum. After testing at the new point, they accept that point as the next guess, \({\bf x}_1\), if the fit statistic is smaller than at the first point, and modify the search procedure if it isn’t smaller. The routines continue to run until one of the following occurs:

all search directions result in an increased value of the fit statistic;

an excessive number of steps have been taken; or

something strange happens to the fit statistic (e.g., it turns out to be discontinuous in some horrible way).

This description indicates that for the single-shot routines, there is a considerable emphasis on the initial search position, \({\bf x}_0\), being reasonable. It may also be apparent that the values of these parameters should be moderate; neither too small (\(10^{-12}\), say), nor too large (\(10^{12}\), say). This is because the initial choice of step size in moving from \({\bf x}_0\) towards the next improved set of parameters, \({\bf x}_1\), is based on the change in the fit statistic, \(f({\bf x})\) as components of \({\bf x}\) are varied by amounts \({\cal O}(1)\). If \(f\) varies little as \({\bf x}\) is varied by this amount, then the calculation of the distance to move to reach the next root may be inaccurate. On the other hand, if \(f\) has a lot of structure (several maxima and minima) as \({\bf x}\) is varied by the initial step size, then these single-shot minimizers may mistakenly jump entirely over the “interesting” region of parameter space.

These considerations suggest that the user should arrange that the search vector is scaled so that the range of parameter space to be searched is neither too large nor too small. To take a concrete example, it would not be a good idea to have \(x_7\) parameterize the Hydrogen column density (\(N_{\rm H}\)) in a spectral fit, with an initial guess of \(10^{20}\ {\rm cm}^{-2}\), and a search range (in units of \({\rm cm}^{-2}\)) of \(10^{16}\) to \(10^{24}\). The minimizers will look for variations in the fit statistic as \(N_{\rm H}\) is varied by \(1\ {\rm cm}^{-2}\), and finding none (to the rounding accuracy likely for the code), will conclude that \(x_7\) is close to being a null parameter and can be ignored in the fitting. It would be much better to have \(x_7 = \log_{10}(N_{\rm H})\), with a search range of 16 to 24. Significant variations in the fit statistic will occur as \(x_7\) is varied by \(\pm 1\), and the code has a reasonable chance of finding a useful solution.

Bearing this in mind, the single-shot minimizers in Sherpa are listed below:

NelderMead: This technique - also known as Simplex - creates a polyhedral search element around the initial position, \({\bf x}_0\), and then grows or shrinks in particular directions while crawling around parameter space, to try to place a minimum within the final search polyhedron. This technique has some hilarious ways of getting stuck in high-dimension parameter spaces (where the polyhedron can become a strange shape), but is very good at finding minima in regions where the fit statistic has a moderately well-defined topology. Since it works in a different way than Levenberg-Marquardt minimization, a good strategy is to combine both minimization to test whether an apparent minimum found by one technique is stable when searched by the other. I regard NelderMead searching as good in smooth and simple parameter spaces, particularly when looking at regions where the fit statistic depends on a parameter in a linear or parabolic fashion, and bad where surfaces of equal value of the fit statistic are complicated. In either case, it is essential that the initial size of the polyhedron (with sides of length 1 unit) is a smallish fraction of the search space.
Levenberg-Marquardt: This can be considered to be a censored maximum-gradients technique which, starting from a first guess, moves towards a minimum by finding a good direction in which to move, and calculating a sensible distance to go. Its principal drawback is that to calculate the distance to move it has to make some assumptions about how large a step size to take, and hence there is an implicit assumption that the search space is reasonably well scaled (to \(\pm 10\) units in each of the search directions, say). It is also important that in finding these gradients, the steps do not miss a lot of important structure; i.e. there should not be too many subsidiary minima. The search directions and distances to move are based on the shape of the target function near the initial guessed minimum, \({\bf x}_0\), with progressive movement towards the dominant local minimum. Since this technique uses information about the local curvature of the fit statistic as well as its local gradients, the approach tends to stabilize the result in somce cases. I regard the techniques implemented in Sherpa as being good minimum-refiners for simple local topologies, since more assumptions about topology are made than in the NelderMead approach, but bad at finding global minima for target functions with complicated topologies.

Scatter-shot techniques

Although a bit ad hoc, these techniques attempt to locate a decent minimum over the entire range of the search parameter space. Because they involve searching a lot of the parameter space, they involve many function evaluations, and are somewhere between quite slow and incredibly-tediously slow.

The routines are listed below:

GridSearch: This routine simply searches a grid in each of the search parameters, where the spacing is uniform between the minimum and maximum value of each parameter. There is an option to refine the fit at each point, by setting the method attribute to one of the single-shot optimisers, but this is not set by default, as it can significantly increase the time required to fit the data. The coarseness of the grid sets how precise a root will be found, and if the fit statistic has significant structure on a smaller scale, then the grid-searcher will miss it completely. This is a good technique for finding an approximation to the minimum for a slowly-varying function. It is a bad technique for getting accurate estimates of the location of a minimum, or for examining a fit statistic with lots of subsidiary maxima and minima within the search space. It is intended for use with template models.
Monte Carlo: This is a simple population based, stochastic function minimizer. At each iteration it combines population vectors - each containing a set of parameter values - using a weighted difference. This optimiser can be used to find solutions to complex search spaces but is not guaranteed to find a global minimum. It is over-kill for relatively simple problems.

Summary and best-buy strategies

Overall, the single-shot methods are best regarded as ways of refining minima located in other ways: from good starting guesses, or from the scatter-shot techniques. Using intelligence to come up with a good first-guess solution is the best approach, when the single-shot refiners can be used to get accurate values for the parameters at the minimum. However, I would certainly recommend running at least a second single-shot minimizer after the first, to get some indication that one set of assumptions about the shape of the minimum is not compromising the solution. It is probably best if the code rescales the parameter range between minimizations, so that a completely different sampling of the function near the trial minimum is being made.

Optimiser	Type	Speed	Commentary
NelderMead	single-shot	fast	OK for refining minima
Levenberg-Marquardt	single-shot	fast	OK for refining minima, should only be used with chi-square statistics
GridSearch	scatter-shot	slow	OK for smooth functions
Monte Carlo	scatter-shot	very slow	Good in many cases