Comments on dlib C++ Library: Automatic Learning Rate Scheduling That Really Works

This is a really helpful way to think about detect...

2022-02-04T10:55:47.518-05:00

This is a really helpful way to think about detecting whether or not a model is still improving, and much more refined than the typical approaches which are more or less ad hoc. One thing I wonder about, though, is what the reasoning is to filter the largest 10% of values. Your observation that sometimes NNs exhibit a transient spike in loss is correct. But it seems like not always removing the largest 10% of values is a bit ad hoc itself; indeed, it's possible that there are not any transient spikes present, or if there are transient spikes, they might comprise more or less than 10% of the data. One option I might consider instead is using Cook's Distance, which is a way to detect "highly influential observations," in the sense of measuring how much a regression changes when the single observation is omitted.

See this section https://en.wikipedia.org/wiki/Ord...

2019-11-08T19:37:34.800-05:00

See this section https://en.wikipedia.org/wiki/Ordinary_least_squares#Finite_sample_properties for notes on the variance computation. I'm not sure where a reference is for the sigma computation. I think I worked it out with pencil and paper.

There is however some parts which I don't unde...

2019-11-08T05:01:05.534-05:00

There is however some parts which I don't understand and I cannot figure it out from the literature:
- why is the variance of the slope not simply R(0, 0)?
- where does the formula for computing sigma^2 comes from? In your code, you do something like this:residual_squared = residual_squared + std::pow((y - trans(x)*w),2.0)*temp;
Any pointers on relevant literature would be helpful, thanks! What I looked at is recursive least squares and Kalman filter.

True, I was exactly wondering the same:) since you...

2019-11-08T02:02:41.422-05:00

True, I was exactly wondering the same:) since you anyway all the time process it in batch sequentially. Anyway, thanks! Good job!

Eh, the variance of the slope is also sigma**2 * R...

2019-11-07T19:55:48.143-05:00

Eh, the variance of the slope is also sigma**2 * R(0,0). So yeah, could have done that too. The deeper question is why in the world did I use recursive least squares for this rather than just inv(XX)*XY. This code is more complex than it needs to be in that way. There is some story that makes sense of how the code evolved to this, I forget what it is though.

Isn't the variance of the slope directly reada...

2019-11-07T04:24:01.033-05:00

Isn't the variance of the slope directly readable in the covariance matrix (R in your code) instead of computing 12 *sigma**2 / (n ** 3 - n)?

I don't see how the weight decay has anything ...

2019-03-25T07:46:46.147-04:00

I don't see how the weight decay has anything to do with the calculations discussed in this blog post. Here, we are talking about detecting when a noisy time series has become asymptotically flat. You could do this with any time series, it doesn't even have to come from a deep learning training procedure. The time series could come from a noisy sensor measuring the water level in a tank and the math would be exactly the same.

And what about the weight decay, as the weight dec...

2019-03-25T02:44:57.669-04:00

And what about the weight decay, as the weight decay will affect the optimization process. It seems that you did not include weight decay in the calculation of the steps_without_decrease. So I think if the weight decay is a bit large, the calculation may give a wrong result?

If your stopping threshold is "stop when it&#...

2019-03-22T07:32:43.967-04:00

If your stopping threshold is "stop when it's been flat for at least 1000 steps" then you only need to keep the last 1000 loss values in Y. Any more history is irrelevant, by definition, since you only care about the last 1000.

I have a question about the selection of Y in your...

2019-03-22T03:57:05.539-04:00

I have a question about the selection of Y in your procedure. I wish to know how many iterations of Y do you keep. For instance, if we set steps_without_decrease to be 1000, how long should we keep Y? or we just use all history of Y, that is, from training started, or we just use Y[-2000:], or Y[-10000:], since I found that calculate the 'steps_without_decrease' with a very large Y is a bit time consuming.

Because there are 2 degrees of freedom when fittin...

2019-01-30T22:16:11.561-05:00

Because there are 2 degrees of freedom when fitting a line, the slope and the intercept. But the best way to see this is to look at the equation for standard error (see https://en.wikipedia.org/wiki/Ordinary_least_squares), apply it to our specific case, and find that you end up with this equation. When in doubt, do the algebra :)

Really simple question, but I can't find the a...

2019-01-30T05:13:39.386-05:00

Really simple question, but I can't find the answer in 15 minutes of Googling: why is the denominator in the variance calculation n-2 and not n-1?

Ah, yes, I didn't realize the substitution had...

2019-01-17T10:58:15.812-05:00

Ah, yes, I didn't realize the substitution had already been made. I see now that this result is easily derived from that assumption.

Yes, I'm quite confident it is correct. If yo...

2019-01-17T07:32:30.592-05:00

Yes, I'm quite confident it is correct. If you take the textbook formula and plug in the specific case here (in particular, x_i is defined on the integers 0,1,2,3,...) you end up with the formula in this post. Just plug in some values and you will see you get the same results.

Are you confident that you have the correct sampli...

2019-01-15T14:01:48.678-05:00

Are you confident that you have the correct sampling distribution for the slope? 12 *sigma**2 / (n ** 3 - n) is not an expression for the variance of the sampling distribution of the slope that I am familiar with.

For example, these slides outline the derivation of the sampling distribution of the OLS slope when we assume a normal distribution for the errors. The derivation for the variance of the sampling distribution of the slope is given as sigma**2 / sum(x_i - bar(x))^2.
http://www.robots.ox.ac.uk/~fwood/teaching/W4315_Fall2011/Lectures/lecture_4/lecture4.pdf

Is there something else going on in this procedure which yields the variance estimator 12 *sigma**2 / (n ** 3 - n) ?

Yes, if that's all you do with it then you jus...

2018-04-09T15:40:56.668-04:00

Yes, if that's all you do with it then you just need to compute it for the longest length. But I find that logging the point at which the loss stopping being flat is a very useful diagnostic.

Yes, it doesn't always output N - I've bee...

2018-04-09T01:33:55.527-04:00

Yes, it doesn't always output N - I've been using the function for the last week and plotting the count to help chose a good threshold. What I'm saying is that if you use N = threshold, all you need to is calculate P one time for n=N, and if P < 0.51 drop the learning rate.

That's not what happens. It doesn't just ...

2018-04-08T20:50:42.128-04:00

That's not what happens. It doesn't just always output N. You should run the function and see what it does. I think that will help understand it.

P.S. if it did exit the loop as soon as P >= 0....

2018-04-08T06:45:20.212-04:00

P.S. if it did exit the loop as soon as P >= 0.51 this could also have problems, e.g. imagine if the last 3 loss values just happened to line up nicely and give P>=0.51...

At present, the count_steps_without_decrease funct...

2018-04-08T06:42:52.108-04:00

At present, the count_steps_without_decrease function returns the *maximum* count (n) for which their is *no* evidence of decreasing, regardless of whether or not smaller values of n had evidence. I.e. if for n = 1:100 you have evidence, but for n = 101 you do not, it would return 101. Sorry to bring this up again but im not sure if it is by design or not (or am I reading the code wrongly!), because that is slightly different to what you write in the blog post. It would be different if instead count_steps_without_decrease returns as soon as it finds n for P >= 0.51, i.e. the minimum n for which there is evidence

Anyway it means that if you save the last N loss values and *also* set the threshold to N, there is no point calculating P for the last n < N loss values... Just calculate P once for n = N and if it less than 0.51 (no evidence) the threshold has been met, because it doesn't matter what P is for n < N. My current code keeps the last 2 * threshold loss values.

I don't think it has much to do with dataset s...

2018-04-07T07:19:29.194-04:00

I don't think it has much to do with dataset size. If it's related to anything in the dataset it's the underlying difficulty of the problem, which is more a function of the signal to noise ratio than anything related to the size. I would simply set it to a large value, the largest value that is tolerable and then not worry about it.

You only need num_steps_without_decrease (the threshold) loss values since you don't look at any values beyond those.

Very nice work Mr. King! I played a bit with your...

2018-04-06T11:21:04.786-04:00

Very nice work Mr. King!

I played a bit with your learning rate scheduler. I think that setting the 'num_steps_without_decrease' threshold should be somehow bound to the dataset size and/or the mini-batch size. Do you have any recommendations?

How many loss values do I have to save to get an accurate estimate? Is 2 * num_steps_without_decrease sufficient?

I don't think the particular setting of the th...

2018-04-05T06:59:30.487-04:00

I don't think the particular setting of the threshold matters. That degree of freedom is taken up by the "how many steps threshold".

The loop needs to run the entire length. You want to use the most confident estimate you will see. So if you get to the end and it's not clear that things are decreasing then you should report that. Anyway, try it and see what happens if what I'm saying isn't clear. You will see that you want to run over the whole array.

On my datasets (greyscale images) I periodically t...

2018-04-05T03:20:41.957-04:00

On my datasets (greyscale images) I periodically train SGD to see if I can get a better result, but validation is always a little worse than ADAM with adaptive training rate. Vanilla ADAM is worse though.

Have you looked into any ways of choosing the threshold? I wonder if it can also be automated.

With regards to the code, count_steps_without_decrease loops until it processes the entire container, should it instead return as soon as it has found P(slope < 0) >= 0.51?

Thanks again, I am using your method now and it works great.

Yes, you have the right idea. You run backwards ov...

2018-04-04T17:25:53.500-04:00

Yes, you have the right idea. You run backwards over the data until P(slope<0) is >= 0.51. If you have to go back really far to see that the slope is decreasing then you have probably converged.

Huh, well, it depends on the problem. Most of the time I find that ADAM makes things worse.