Very nice work Mr. King!

I played a bit with your learning rate scheduler. I think that setting the 'num_steps_without_decrease' threshold should be somehow bound to the dataset size and/or the mini-batch size. Do you have any recommendations?

How many loss values do I have to save to get an accurate estimate? Is 2 * num_steps_without_decrease sufficient?

I don't think the particular setting of the threshold matters. That degree of freedom is taken up by the "how many steps threshold".

The loop needs to run the entire length. You want to use the most confident estimate you will see. So if you get to the end and it's not clear that things are decreasing then you should report that. Anyway, try it and see what happens if what I'm saying isn't clear. You will see that you want to run over the whole array.

On my datasets (greyscale images) I periodically train SGD to see if I can get a better result, but validation is always a little worse than ADAM with adaptive training rate. Vanilla ADAM is worse though.

Have you looked into any ways of choosing the threshold? I wonder if it can also be automated.

With regards to the code, count_steps_without_decrease loops until it processes the entire container, should it instead return as soon as it has found P(slope < 0) >= 0.51?

Thanks again, I am using your method now and it works great.

Yes, you have the right idea. You run backwards over the data until P(slope<0) is >= 0.51. If you have to go back really far to see that the slope is decreasing then you have probably converged. 

Huh, well, it depends on the problem. Most of the time I find that ADAM makes things worse.

By the way, I find that using an adaptive learning rate schedule with ADAM makes it work even better.

Ok I think I got it: You calculate P(slope < 0) for the last n loss values, for n in 1:N (N being the container size) and chose the maximum value of n (n_max) where P(slope < 0) < 0.51. That is, for all n in n_max+1:N, P(slope < 0) >= 0.51. (Or should it be minimum n where P(slope < 0) >= 0.51?)

There aren't N slope values, there is one slope value. You find it using OLS. You can do a recursive OLS if you want. It doesn't matter how you find it so long as you find it.

Hi, thanks for the post. I'm in the process of implementing this. My current method is if the average loss of the last n batches, has not decreased for the last m times, half the learning rate.

Just want to confirm one detail - it looks like you calculate the n slope values of Y using the values Y[n-i:n], e.g. there will be 2,3,4,5 -> n values used for the calculation each time. Why do it like this, instead of for example keeping a running score of P always calculated using the last n values of Y?

There is dlib.probability_that_sequence_is_increasing()

Very interesting and seamless automation indeed.
I notice that the python module dlib does contain the count_steps_without_decrease function. However, regarding your last comment about resetting the training at an earlier stage, I don't see the P(slope>X) function being exposed. 
It is in the c++ source and I wonder if it could be made exposed to the python side.
Thx

Yes, you could do that and it wouldn't be awful. However, there are cases where it would put you into an infinite loop. Consider the case where the slope of the loss curve is asymptotically approaching zero but never goes positive. Simply thresholding m at 0 will never terminate but the test I suggested would. Most real world problems probably don't exhibit that kind of behavior very often, but these kinds of corner cases are things you need to be concerned with if you want to make a robust numerical solver.

Slope(Y) is a Gaussian variable. So you just need to know its mean and variance and then you call a function that computes the CDF of a Gaussian. Every platform has such functions already available. You don't need to implement it yourself.

Pardon me if this question seems elementary, But I can't understand why we have to calculate P(Slope(Y) < 0)... 

Can't we just do OLS and get an approximate value for `m`, then whether it's negative or positive we take action ?

What I wonder is two things:

1. What benefit does calculating the probability have when we can have an approximate `m`?

2. How exactly is the probability computed? I can understand the OLS but I couldn't get how you went from `Slope(Y)` to P(Slope(Y) < 0)... how did you calculate the probability?

Thanks for this great Blog post...

Some models take days to train while others take minutes. You don't want to just always run for days, that's a waste of time. Conversely, if you don't let the solver run long enough you will underfit. How are you going to know, ahead of time, how long to run the solver? You don't. You need to measure progress while the solver is running and do something reasonable. Just running blind to the problem you are optimizing is never going to be a good idea.

Why it's better than predefined Learning rate policy, like steps or polynomial decay?