Thursday, December 28, 2017

A Global Optimization Algorithm Worth Using

Here is a common problem: you have some machine learning algorithm you want to use but it has these damn hyperparameters. These are numbers like weight decay magnitude, Gaussian kernel width, and so forth. The algorithm doesn't set them, instead, it's up to you to determine their values. If you don't set these parameters to "good" values the algorithm doesn't work. So what do you do? Well, here is a list of everything I've seen people do, listed in order of most to least common:
  • Guess and Check: Listen to your gut, pick numbers that feel good and see if they work. Keep doing this until you are tired of doing it.
  • Grid Search: Ask your computer to try a bunch of values spread evenly over some range.
  • Random Search: Ask your computer to try a bunch of values by picking them randomly.
  • Bayesian Optimization: Use a tool like MATLAB's bayesopt to automatically pick the best parameters, then find out Bayesian Optimization has more hyperparameters than your machine learning algorithm, get frustrated, and go back to using guess and check or grid search.
  • Local Optimization With a Good Initial Guess: This is what MITIE does, it uses the BOBYQA algorithm with a well chosen starting point. Since BOBYQA only finds the nearest local optima the success of this method is heavily dependent on a good starting point. In MITIE's case we know a good starting point, but this isn't a general solution since usually you won't know a good starting point. On the plus side, this kind of method is extremely good at finding a local optima. I'll have more to say on this later.
The vast majority of people just do guess and check. That sucks and there should be something better. We all want some black-box optimization strategy like Bayesian optimization to be useful, but in my experience, if you don't set its hyperparameters to the right values it doesn't work as well as an expert doing guess and check. Everyone I know who has used Bayesian optimization has had the same experience. Ultimately, if I think I can do better hyperparameter selection manually then that's what I'm going to do, and most of my colleagues feel the same way. The end result is that I don't use automated hyperparameter selection tools most of the time, and that bums me out. I badly want a parameter-free global optimizer that I can trust to do hyperparameter selection.

So I was very excited when I encountered the paper Global optimization of Lipschitz functions by Cédric Malherbe and Nicolas Vayatis in this year's international conference on machine learning. In this paper, they propose a very simple parameter-free and provably correct method for finding the $x \in \mathbb{R}^d$ that maximizes a function, $f(x)$, even if $f(x)$ has many local maxima. The key idea in their paper is to maintain a piecewise linear upper bound of $f(x)$ and use that to decide which $x$ to evaluate at each step of the optimization. So if you already evaluated the points $x_1, x_2, \cdots, x_t$ then you can define a simple upper bound on $f(x)$ like this:
\[ \newcommand{\norm}[1]{\left\lVert#1\right\rVert} U(x) = \min_{i=1\dots t} (f(x_i) + k \cdot \norm{x-x_i}_2 ) \] Where $k$ is the Lipschitz constant for $f(x)$. Therefore, it is trivially true that $U(x) \geq f(x), \forall x$, by the definition of the Lipschitz constant. The authors go on to suggest a simple algorithm, called LIPO, that picks points at random, checks if the upper bound for the new point is better than the best point seen so far, and if so selects it as the next point to evaluate. For example, the figure below shows a plot of a simple $f(x)$ in red with a plot of its associated upper bound $U(x)$ in green. In this case $U(x)$ is defined by 4 points, indicated here with little black squares.



It shouldn't take a lot of imagination to see how the upper bound helps you pick good points to evaluate. For instance, if you selected the max upper bound as the next iterate you would already get pretty close to the global maximizer. The authors go on to prove a bunch of nice properties of this method. In particular, they both prove mathematically and show empirically that the method is better than random search in a number of non-trivial situations. This is a fairly strong statement considering how competitive random hyperparameter search turns out to be relative to competing hyperparameter optimization methods. They also compare the method to other algorithms like Bayesian optimization and show that it's competitive.

But you are probably thinking: "Hold on a second, we don't know the value of the Lipschitz constant $k$!". This isn't a big deal since it's easily estimated, for instance, by setting $k$ to the largest observed slope of $f(x)$ before each iteration. That's equivalent to solving the following easy problem:
\begin{align}
\min_{k} & \quad k^2 \\
\text{s.t.} & \quad U(x_i) \geq f(x_i), \quad \forall i \in [1\dots t] \\
& \quad k \geq 0
\end{align} Malherbe et al. test a variant of this $k$ estimation approach and show it works well.

This is great. I love this paper. It's proposing a global optimization method called LIPO that is both parameter free and provably better than random search. It's also really simple. Reading this paper gives you one of those "duah" moments where you wonder why you didn't think of this a long time ago. That's the mark of a great paper. So obviously I was going to add some kind of LIPO algorithm to dlib, which I did in the recent dlib v19.8 release.

However, if you want to use LIPO in practice there are some issues that need to be addressed. The rest of this blog post discusses these issues and how the dlib implementation addresses them. First, if $f(x)$ is noisy or discontinuous even a little it's not going to work reliably since $k$ will be infinity. This happens in real world situations all the time. For instance, evaluating a binary classifier against the 0-1 loss gives you an objective function with little discontinuities anywhere samples switch their predicted class. You could cross your fingers and run LIPO anyway, but you run the very real risk of two $x$ samples closely straddling a discontinuity and causing the estimated $k$ to explode. Second, not all hyperparameters are equally important, some hardly matter while small changes in others drastically affect the output of $f(x)$. So it would be nice if each hyperparameter got its own $k$. You can address these problems by defining the upper bound $U(x)$ as follows:
\[ U(x) = \min_{i=1\dots t} \left[ f(x_i) + \sqrt{\sigma_i +(x-x_i)^\intercal K (x-x_i)} \ \right] \] Now each sample from $f(x)$ has its own noise term, $\sigma_i$, which should be 0 most of the time unless $x_i$ is really close to a discontinuity or there is some stochasticity. Here, $K$ is a diagonal matrix that contains our "per hyperparameter Lipschitz $k$ terms". With this formulation, setting each $\sigma$ to 0 and $K=k^2I$ gives the same $U(x)$ as suggested by Malherbe et al., but if we let them take more general values we can deal with the above mentioned problems.

Just like before, we can find the parameters of $U(x)$ by solving an optimization problem:
\begin{align}
\min_{K,\sigma} & \quad \norm{K}^2_F + 10^6 \sum_{i=1}^t {\sigma_i^2} &\\
\text{s.t.} & \quad U(x_i) \geq f(x_i), & \quad \forall i \in [1\dots t] \\
& \quad \sigma_i \geq 0 & \quad \forall i \in [1\dots t] \\
& \quad K_{i,j} \geq 0 & \quad \forall i,j \in [1\dots d] \\
& \quad \text{K is a diagonal matrix}
\end{align} The $10^6$ penalty on $\sigma^2$ causes most $\sigma$ terms to be exactly 0. The behavior of the whole algorithm is insensitive to the particular penalty value used here, so long as it's reasonably large the $\sigma$ values will be 0 most of the time while still preventing $k$ from becoming infinite, which is the behavior we want. It's also possible to rewrite this as a big quadratic programming problem and solve it with a dual coordinate descent method. I'm not going into the details here. It's all in the dlib code for those really interested. The TL;DR is that it turns out to be easy to solve using well known methods and it fixes the infinite $k$ problem.

The final issue that needs to be addressed is LIPO's terrible convergence in the area of a local maximizer. So while it's true that LIPO is great at getting onto the tallest peak of $f(x)$, once you are there it does not make very rapid progress towards the optimal location (i.e. the very top of the peak). This is a problem shared by many derivative free optimization algorithms, including MATLAB's Bayesian optimization tool. Fortunately, not all methods have this limitation. In particular, the late and great Michael J. D. Powell wrote a series of papers on how to apply classic trust region methods to derivative free optimization. These methods fit a quadratic surface around the best point seen so far and then take the next iterate to be the maximizer of that quadratic surface within some distance of the current best point. So we "trust" this local quadratic model to be accurate within some small region around the best point, hence the name "trust region". The BOBYQA method I mentioned above is one of these methods and it has excellent convergence to the nearest local optima, easily finding local optima to full floating point precision in a very small number of steps.

We can fix LIPO's convergence problem by combining these two methods, LIPO will explore $f(x)$ and quickly find a point on the biggest peak. Then a Powell style trust region method can efficiently find the exact maximizer of that peak. The simplest way to combine these two things is to alternate between them, which is what dlib does. On even iterations we pick the next $x$ according to our upper bound while on odd iterations we pick the next $x$ according to the trust region model. I've also used a slightly different version of LIPO that I'm calling MaxLIPO. Recall that Malherbe et al. suggest selecting any point with an upper bound larger than the current best objective value. However, I've found that selecting the maximum upper bounding point on each iteration is slightly better. This alternative version, MaxLIPO, is therefore what dlib uses. You can see this hybrid of MaxLIPO and a trust region method in action in the following video:


In the video, the red line is the function to be optimized and we are looking for the maximum point. Every time the algorithm samples a point from the function we note it with a little box. The state of the solver is determined by the global upper bound $U(x)$ and the local quadratic model used by the trust region method. Therefore, we draw the upper bounding model as well as the current local quadratic model so you can see how they evolve as the optimization proceeds. We also note the location of the best point seen so far by a little vertical line.

You can see that the optimizer is alternating between picking the maximum upper bounding point and the maximum point according to the quadratic model. As the optimization proceeds, the upper bound becomes progressively more accurate, helping to find the best peak to investigate, while the quadratic model quickly finds a high precision maximizer on whatever peak it currently rests. These two things together allow the optimizer to find the true global maximizer to high precision (within $\pm{10^{-9}}$ in this case) by the time the video concludes.

The Holder Table Test Function
from https://en.wikipedia.org/wiki/File:Holder_table_function.pdf

Now let's do an experiment to see how this hybrid of MaxLIPO and Powell's trust region method (TR) compares to MATLAB's Bayesian optimization tool with its default settings. I ran both algorithms on the Holder table test function 100 times and plotted the average error with one standard deviation error bars. So the plot below shows $f(x^\star)-f(x_i)$, the difference between the true global optimum and the best solution found so far, as a function of the number of calls to $f(x)$. You can see that MATLAB's BayesOpt stalls out at an accuracy of about $\pm{10^{-3}}$ while our hybrid method (MaxLIPO+TR, the new method in dlib) quickly approaches full floating point precision of around $\pm{10^{-17}}$.


I also reran some of the tests from Figure 5 of the LIPO paper. The results are shown in the table below. In these experiments I compared the performance of LIPO with and without the trust region solver (LIPO+TR and LIPO). Additionally, to verify that LIPO is better than pure random search I tested a version of the algorithm that alternates between pure random search and the trust region solver (PRS+TR) rather than alternating between a LIPO method and a trust region solver (LIPO+TR and MaxLIPO+TR). Pure random search (PRS) is also included for reference. Finally, the new algorithm implemented in dlib, MaxLIPO+TR, is included as well. In each test I ran the algorithm 1000 times and recorded the mean and standard deviation of the number of calls to $f(x)$ required to reach a particular solution accuracy. For instance, $\epsilon=0.01$ means that $f(x^\star)-f(x_i) \leq 0.01$, while "target 99%" uses the "target" metric from Malherbe's paper, which for most tests corresponds to an $\epsilon > 0.1$. Tests that took too long to execute are noted with a - symbol.

The key points to notice about these results are that the addition of a trust region method allows LIPO to reach much higher solution accuracy. It also makes the algorithm run faster. Recall that LIPO works internally by using random search of $U(x)$. Therefore, the number of calls LIPO makes to $U(x)$ is at least as many as PRS would require when searching $f(x)$. So for smaller $\epsilon$ it becomes very expensive to execute LIPO. For instance, I wasn't able to get results for LIPO, by itself, at accuracies better than $0.1$ on any of the test problems since it took too long to execute. However, with a trust region method the combined algorithm can easily achieve high precision solutions. The other significant detail is that, for tests with many local optima, all methods combining LIPO with TR are much better than PRS+TR. This is most striking on ComplexHolder, which is a version of the HolderTable test function with additional high frequency sinusoidal noise that significantly increases the number of local optima. On ComplexHolder, LIPO based methods require about an order of magnitude fewer calls to $f(x)$ than PRS+TR, further justifying the claims by Malherbe et al. of the superiority of LIPO relative to pure random search.


The new method in dlib, MaxLIPO+TR, fares the best in all my tests. What is remarkable about this method is its simplicity. In particular, MaxLIPO+TR doesn't have any hyperparameters, making it very easy to use. I've been using it for a while now for hyperparameter optimization and have been very pleased. It's the first black-box hyperparameter optimization algorithm I've had enough confidence in to use on real problems.

Finally, here is an example of how you can use this new optimizer from Python:
def holder_table(x0,x1):
    return -abs(sin(x0)*cos(x1)*exp(abs(1-sqrt(x0*x0+x1*x1)/pi)))

x,y = dlib.find_min_global(holder_table, 
                           [-10,-10],  # Lower bound constraints on x0 and x1 respectively
                           [10,10],    # Upper bound constraints on x0 and x1 respectively
                           80)         # The number of times find_min_global() will call holder_table()

Or in C++11:
auto holder_table = [](double x0, double x1) {return -abs(sin(x0)*cos(x1)*exp(abs(1-sqrt(x0*x0+x1*x1)/pi)));};

// obtain result.x and result.y
auto result = find_min_global(holder_table, 
                             {-10,-10}, // lower bounds
                             {10,10}, // upper bounds
                             max_function_calls(80));

Both of these methods find holder_table's global optima to about 12 digits of precision in about 0.1 seconds. The C++ API exposes a wide range of ways to call the solver, including optimizing multiple functions at a time and adding integer constraints. See the documentation for full details.

51 comments :

blackball said...

The first best read (including the references) in my 2018. Thanks!

Davis King said...

Thanks :)

Piotr said...

Holy Cow, this is amazing. Thanks for sharing it in such an infromative and detailed manner. Thanks for making 2018 cool again.

Davis King said...

No problem :)

Unknown said...

Fantastic work! I enormously appreciate that you take the time to make things work for real, combining multiple methods if necessary.

I'm quite interested in knowing the practical limitations of this approach, you mentioned discontinuities and noise, when applied to the objective functions of neural networks. Would it work for a some NN models - what maximum network size? Is the dimensionality of the function a problem?

Thank you!

Davis King said...

Thanks :)

You definitely can't use this to find the parameters of a NN. I wouldn't attempt to optimize functions with more than 10s of parameters with a derivative free optimizer. Even that is pushing it. You really need to know something about the structure of your objective function if you want to optimize something with large numbers of variables. In particular, you need to know the gradient at least.

Consider for example that a 10000 dimensional objective function requires 10001 evaluations to even estimate the gradient. That's a huge problem for a derivative free optimizer.

Stergios said...

Great read!! Thank you!

One question: Would it be feasible to enforce some restrictions on the parameters e.g. take only integer values or support categorical features? I guess I could apply them inside the objective function (e.g. through rounding and by transforming numeric values to categories), but would this make sense?

涵胡 said...

excellent work.I compare the algrithm with AdamGradient and rms method ,find it is much better optimizer when the effictive value range is narrow.

i test the function 1/(1+exp(100*x))+1/(1+exp(-100*(x-0.05))), and get the resault following:
global optimal inputs: 0.0
global optimal output: 0.5066928509242848
adam optimal inputs: nan
adam optimal output: nan

here is my code:
import tensorflow as tf
import dlib
import math
def test_fun(x0):
return 1/(1+math.exp(100*x0))+1/(1+math.exp(-100*(x0-0.05)))
def test_fun_tf(x0):
return 1/(1+tf.exp(100*x0))+1/(1+tf.exp(-100*(x0-0.05)))
x,y = dlib.find_min_global(test_fun,[-6],[6],10)
print("global optimal inputs: {}".format(x[0]))
print("global optimal output: {}".format(y))
t_x0=tf.Variable(tf.random_uniform((),-2,2))
cost=test_fun_tf(t_x0)
opt=tf.train.AdamOptimizer(1e-6).minimize(cost)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(10):
loss,_,x=sess.run([cost,opt,t_x0])
print("adam optimal inputs: {}".format(x))
print("adam optimal output: {}".format(loss))

Davis King said...

The code has an option to say some variables are integers. This just removes those variables from consideration by the trust region solver, meaning their optimization is then up to just LIPO, which is most likely what you want to do.

涵胡 said...
This comment has been removed by the author.
涵胡 said...

The variabeles are all float. Actually, when I expand the trust region, adam gonna work. But still not good as global method.

----function-----
def test_fun(x0):
return 1/(1+math.exp(10*x0))+1/(1+math.exp(-10*(x0-0.05)))
def test_fun_tf(x0):
return 1/(1+tf.exp(10*x0))+1/(1+tf.exp(-10*(x0-0.05)))

----result-----
global optimal inputs: 0.02497796370782311
global optimal output: 0.875646999714699
adam optimal inputs: 0.052661534398794174
adam optimal output: 0.8779600858688354

arash allahyari said...

Hi happy new year
i wanted to build dlib with cuda and cudnn
using cmake it says
"Building a CUDA test project to see if your compiler is compatible with CUDA...
CUDA was found but your compiler failed to compile a simple CUDA program so dlib isn't going to use CUDA "

i am using mingw530_32

what is wrong with this compiler?

Davis King said...

You have to use a compiler supported by CUDA. Like visual studio 2015. You should refer to the NVIDIA documentation for complete details.

Andreas Mueller said...

This is a great write-up and excellent analysis! Thank you.

Davis King said...

Thanks :)

Andreas Mueller said...

Btw, is MaxLIPO without finding k equivalent to a GP with acquisition function? Looks very similar...

Davis King said...

Yeah, I'm sure there is a way to write it as a GP, like by selecting the right sparse kernel or something. I haven't thought about how you would do it though.

David J. Elkind said...
This comment has been removed by the author.
Davis King said...

I'm not sure I follow your question. The optimization problem you need to solve to find k is a quadratic program so you can use any quadratic program solver. Maybe you are asking about the sqrt() in the constraints? If you just square both sides of the constraints with sqrt() in them you get linear constraints.

Andreas Mueller said...

Btw, have you seen this: https://github.com/chrisstroemel/Simple looks quite similar and kinda cool.

Davis King said...

No, haven't seen that. Seems like a reasonable thing to do. There are a lot of these partitioning algorithms that don't get as much love as they probably deserve.

Chia-Li Kuo said...

Interesting and elegant work. I'll definitely try it out on a real problem.

For the Python API call find_min_global, in addition to the optimum parameters and value, is there a way to get all the parameters the algorithm has tried and their corresponding objective values?

Thanks.

Davis King said...

You can add whatever kind of logging code you want to the function find_min_global() calls. You define that function so you can put any code in it that you like.

Chia-Li Kuo said...

Makes sense. Thanks Davis.

ashwini pal said...

Nice explanation!. Are you using uniform distribution when randomly drawing the next candidate for evaluation?

Davis King said...

Thanks. Yes, that's what it does.

chholtz said...
This comment has been removed by the author.
chholtz said...

I am wondering how this method scales as the search space gets very big. I am worried that the number of constraints in the QP will get too large. What approaches do you recommend to solve this problem?

Davis King said...

It's fine. Solving the QP is not the bottleneck since constraints become inactive and stay inactive and can be pruned. The bigger issue is the overall quadratic runtime due to evaluations of U(x) having linear runtime.

Anyway, try calling it. It's plenty fast and there is a lot of room in the code to make it run even faster. But for hyper-parameter search it's way faster than it needs to be already.

Hao Zhang said...

Hi, I'm reading the paper "Global optimization of Lipschitz functions" recently and have a question about it. Hope you can shed some light on it.

In the AdaLIPO algorithm, the last line of Step 2 in Figure 4, they set kt_hat as the infimum of the set {ki}. According to this, the resulting kt_hat will only satisfy the Lipschitz inequality for a particular Xi paired with all the other Xj (where j != i). It does NOT satisfy the Lipschitz inequality for all pairs of (Xi, Xj) since one only pick the infimum (minimum). Am I reading it wrong or do they have a good reason for this that I'm missing?

Thanks for your attention.

Davis King said...

No, that's not what it says. It says they pick the smallest k that satisfies the Lipschitz condition everywhere.

Hao Zhang said...

I just realized that my confusion originated from their notation. In that formula, they have both k_i and X_i, but these are actually not the same "i"s.

Shubham Ashok Gandhi said...

Hello Davis,

Great work with the library. This could be a major relief for anyone who has to deal with parameter tuning.

I have a question for you. From reading your post, the method seems to be a sequential, direction based method. Is that correct? We are looking to automate some of the model building process and hyper-parameter tuning is one of them and so I wanted to understand if it is possible to distribute the computations of the algorithm across nodes.

So, given the nature of the algorithm, is it possible to have a distributed version of this?

Davis King said...

Yes, you can use it with parallel function evaluations. I specifically made the API in a way that supports that. Read the documentation here http://dlib.net/optimization.html#global_function_search and also click on the more details button and read the documentation. There is a lot of discussion of this topic there. The entire API is designed around the goal of supporting parallel function evaluation.

Chavdar Papazov said...

Hi Davis,

You mentioned that in your MaxLIPO version of LIPO you select the maximum upper bounding point, i.e., you are maximizing U(x). How exactly do you do this? Thank you.

Davis King said...

If you square both sides of the constraints in the QP shown in the post you get a QP in canonical form and can use any QP solver you want to solve it. That gives you U(x).

bkj said...

Super interesting. I was going to try to use this to optimize some neural network hyperparameters (LR, momentum, etc) -- but I was wondering how well you'd expect this to work when there's variation in the function evaluation. Any thoughts?

Thanks

Davis King said...

That's exactly what this optimizer is for. So it will work fine.

new home Fox Chapel 15238 said...

Hi Davis,
Thanks for the great work
I am doing a port to scala so that it can be used in spark env, would you mind helping me understand some of the issues:
https://github.com/davisking/dlib/blob/master/dlib/global_optimization/upper_bound_function.h#L217 trainer.force_last_weight_to_1(true);
any reason to set this? I am using liblinear-java, which does not have force_last_weight_to_1
https://github.com/bwaldvogel/liblinear-java/

Thanks!

Davis King said...

Yes, you need that.

Why don't you just call dlib? It's not hard to call C++ from spark.

new home Fox Chapel 15238 said...

I would love to, however, in the spark mode, each model may run on one or multiple executors, not sure how to wrap it with function evaluation

Yitong Zhou said...

Great blog Davis and very intriguing! The solver part of delegating it back to solve a similar SVM problem is so elegant!

One question though, assuming after we estimate out a good U(x) with minimized K in each step; how should we decide the next evaluation point? Looks to me that in one-dimensional case, all candidate points are within a fixed set: {x|x=(xi + xj) / 2}, so it's only O(n^2) complexity to query U(x). How about higher dimensional cases? Can we still only evaluate those boundary points? Or maybe we need to do random sampling?

Davis King said...

I just evaluate 5000 random points and take the best. I thought about how you might try to optimize U(x) exactly, but I'm not sure there is an efficient way. And random search of U(x) is plenty good enough.

Matías Mattamala said...

Hi Davis! Thank you very much for this impressive algorithm. I was recently struggling with Bayesian Optimization and its parameters, so I definitively got what you were talking about in the first paragraphs of this article.

However, I'm wondering if it possible to save the "current state" of the optimization process. BayesOpt (the C++ lib) has a very nice functionality that allows to save the current sampled points as well as the parameters of the optimization, being able to resume it later; I think that something similar could be possible in this case saving the current bounds.

I haven't explored DLib fully (I just discovered it some weeks ago), so I don't know if this is currently implemented, but it would be very useful if it is not.

Thanks!

Davis King said...

Thanks, glad you like it.

Yes, you can do this. See the extended discussion here: http://dlib.net/dlib/global_optimization/global_function_search_abstract.h.html#global_function_search. That's the C++ interface. There is a python interface with essentially an identical API as well.

Mario Rivera said...

Hi

Excellent work, thank you for sharing.
Two questions:
1) Can this be combined with hyperband https://archive.is/CDVR8 ?
2) How does this compare to HORD / pysot (https://archive.is/YgRvr)? Do you have any comment on that approach?

Greetings from London

Davis King said...

You could combine the trust region strategy with pretty much any other derivative free solver. I used LIPO here because I find LIPO the most compelling. I have not compared it to HORD.

Junwei Dong said...

Is it possible to use this brilliant optimizer in the training of neural network of machine learning? Is this optimzer is something that could replace traditional optimizer like Adam,AdaGrad...?

Davis King said...

No, it's not going to be reasonable to use this in place of adam or sgd.

Mario Rivera said...

Hi Davis

Apologies in advance if some of these is very basic.

RE:
"You could combine the trust region strategy with pretty much any other derivative free solver."
Just to be 100% clear: "the trust region strategy" is *itself* derivative-free (looking at BOBYQA definition) - is my understanding right?


RE:
Q: "Is it possible to use this brilliant optimizer in the training of neural network of machine learning? Is this optimzer is something that could replace traditional optimizer like Adam,AdaGrad...?"
A: "No, it's not going to be reasonable to use this in place of adam or sgd."

Is this true only of neural network hyperparam optimisations, or more in genearl of machine learning hyperparam optimisation problems?
Would you mind expanding on this?

Reiterating my gratitude for your excellent work.

Davis King said...

Yes, it's all derivative free.

This kind of algorithm only works for problems with a relatively small number of parameters. There is no way it could possibly optimize a problem with 10s of thousands of parameters (or more, some DNNs have millions to billions of parameters). This is true for any derivative free algorithm. It's just not going to work.

But if you have something like 5 parameters you want to optimize then it's great.