tag:blogger.com,1999:blog-6061887630060661987.post8245786136820337434..comments2019-02-16T18:38:45.386-05:00Comments on dlib C++ Library: Automatic Learning Rate Scheduling That Really WorksDavis Kingnoreply@blogger.comBlogger25125tag:blogger.com,1999:blog-6061887630060661987.post-90824508183379030812019-01-30T22:16:11.561-05:002019-01-30T22:16:11.561-05:00Because there are 2 degrees of freedom when fittin...Because there are 2 degrees of freedom when fitting a line, the slope and the intercept. But the best way to see this is to look at the equation for standard error (see https://en.wikipedia.org/wiki/Ordinary_least_squares), apply it to our specific case, and find that you end up with this equation. When in doubt, do the algebra :)Davis Kinghttps://www.blogger.com/profile/16577392965630448489noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-29078566252609473362019-01-30T05:13:39.386-05:002019-01-30T05:13:39.386-05:00Really simple question, but I can't find the a...Really simple question, but I can't find the answer in 15 minutes of Googling: why is the denominator in the variance calculation n-2 and not n-1?Matthttps://www.blogger.com/profile/04236324797918643122noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-2089755737283281922019-01-17T10:58:15.812-05:002019-01-17T10:58:15.812-05:00Ah, yes, I didn't realize the substitution had...Ah, yes, I didn't realize the substitution had already been made. I see now that this result is easily derived from that assumption.David J. Elkindhttps://www.blogger.com/profile/14830054992462458290noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-55833584267423691222019-01-17T07:32:30.592-05:002019-01-17T07:32:30.592-05:00Yes, I'm quite confident it is correct. If yo...Yes, I'm quite confident it is correct. If you take the textbook formula and plug in the specific case here (in particular, x_i is defined on the integers 0,1,2,3,...) you end up with the formula in this post. Just plug in some values and you will see you get the same results.Davis Kinghttps://www.blogger.com/profile/16577392965630448489noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-21289488409265357842019-01-15T14:01:48.678-05:002019-01-15T14:01:48.678-05:00Are you confident that you have the correct sampli...Are you confident that you have the correct sampling distribution for the slope? 12 *sigma**2 / (n ** 3 - n) is not an expression for the variance of the sampling distribution of the slope that I am familiar with. <br /><br />For example, these slides outline the derivation of the sampling distribution of the OLS slope when we assume a normal distribution for the errors. The derivation for the variance of the sampling distribution of the slope is given as sigma**2 / sum(x_i - bar(x))^2.<br />http://www.robots.ox.ac.uk/~fwood/teaching/W4315_Fall2011/Lectures/lecture_4/lecture4.pdf<br /><br />Is there something else going on in this procedure which yields the variance estimator 12 *sigma**2 / (n ** 3 - n) ?<br />David J. Elkindhttps://www.blogger.com/profile/14830054992462458290noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-30342675561652232422018-04-09T15:40:56.668-04:002018-04-09T15:40:56.668-04:00Yes, if that's all you do with it then you jus...Yes, if that's all you do with it then you just need to compute it for the longest length. But I find that logging the point at which the loss stopping being flat is a very useful diagnostic.Davis Kinghttps://www.blogger.com/profile/16577392965630448489noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-91353666305352181122018-04-09T01:33:55.527-04:002018-04-09T01:33:55.527-04:00Yes, it doesn't always output N - I've bee...Yes, it doesn't always output N - I've been using the function for the last week and plotting the count to help chose a good threshold. What I'm saying is that if you use N = threshold, all you need to is calculate P one time for n=N, and if P < 0.51 drop the learning rate.chaoshttps://www.blogger.com/profile/02783193601301792115noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-27568206434561200752018-04-08T20:50:42.128-04:002018-04-08T20:50:42.128-04:00That's not what happens. It doesn't just ...That's not what happens. It doesn't just always output N. You should run the function and see what it does. I think that will help understand it.Davis Kinghttps://www.blogger.com/profile/16577392965630448489noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-26919352732560465302018-04-08T06:45:20.212-04:002018-04-08T06:45:20.212-04:00P.S. if it did exit the loop as soon as P >= 0....P.S. if it did exit the loop as soon as P >= 0.51 this could also have problems, e.g. imagine if the last 3 loss values just happened to line up nicely and give P>=0.51...chaoshttps://www.blogger.com/profile/02783193601301792115noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-25597101861654560252018-04-08T06:42:52.108-04:002018-04-08T06:42:52.108-04:00At present, the count_steps_without_decrease funct...At present, the count_steps_without_decrease function returns the *maximum* count (n) for which their is *no* evidence of decreasing, regardless of whether or not smaller values of n had evidence. I.e. if for n = 1:100 you have evidence, but for n = 101 you do not, it would return 101. Sorry to bring this up again but im not sure if it is by design or not (or am I reading the code wrongly!), because that is slightly different to what you write in the blog post. It would be different if instead count_steps_without_decrease returns as soon as it finds n for P >= 0.51, i.e. the minimum n for which there is evidence<br /><br />Anyway it means that if you save the last N loss values and *also* set the threshold to N, there is no point calculating P for the last n < N loss values... Just calculate P once for n = N and if it less than 0.51 (no evidence) the threshold has been met, because it doesn't matter what P is for n < N. My current code keeps the last 2 * threshold loss values.chaoshttps://www.blogger.com/profile/02783193601301792115noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-36078676486644488892018-04-07T07:19:29.194-04:002018-04-07T07:19:29.194-04:00I don't think it has much to do with dataset s...I don't think it has much to do with dataset size. If it's related to anything in the dataset it's the underlying difficulty of the problem, which is more a function of the signal to noise ratio than anything related to the size. I would simply set it to a large value, the largest value that is tolerable and then not worry about it. <br /><br />You only need num_steps_without_decrease (the threshold) loss values since you don't look at any values beyond those.Davis Kinghttps://www.blogger.com/profile/16577392965630448489noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-17856985120847684982018-04-06T11:21:04.786-04:002018-04-06T11:21:04.786-04:00Very nice work Mr. King!
I played a bit with your...Very nice work Mr. King!<br /><br />I played a bit with your learning rate scheduler. I think that setting the 'num_steps_without_decrease' threshold should be somehow bound to the dataset size and/or the mini-batch size. Do you have any recommendations?<br /><br />How many loss values do I have to save to get an accurate estimate? Is 2 * num_steps_without_decrease sufficient?nuke16000https://www.blogger.com/profile/04381007601490822999noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-48792398439844449042018-04-05T06:59:30.487-04:002018-04-05T06:59:30.487-04:00I don't think the particular setting of the th...I don't think the particular setting of the threshold matters. That degree of freedom is taken up by the "how many steps threshold".<br /><br />The loop needs to run the entire length. You want to use the most confident estimate you will see. So if you get to the end and it's not clear that things are decreasing then you should report that. Anyway, try it and see what happens if what I'm saying isn't clear. You will see that you want to run over the whole array.Davis Kinghttps://www.blogger.com/profile/16577392965630448489noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-5234607889359826542018-04-05T03:20:41.957-04:002018-04-05T03:20:41.957-04:00On my datasets (greyscale images) I periodically t...On my datasets (greyscale images) I periodically train SGD to see if I can get a better result, but validation is always a little worse than ADAM with adaptive training rate. Vanilla ADAM is worse though.<br /><br />Have you looked into any ways of choosing the threshold? I wonder if it can also be automated.<br /><br />With regards to the code, count_steps_without_decrease loops until it processes the entire container, should it instead return as soon as it has found P(slope < 0) >= 0.51?<br /><br />Thanks again, I am using your method now and it works great.chaoshttps://www.blogger.com/profile/02783193601301792115noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-29636845157295294052018-04-04T17:25:53.500-04:002018-04-04T17:25:53.500-04:00Yes, you have the right idea. You run backwards ov...Yes, you have the right idea. You run backwards over the data until P(slope<0) is >= 0.51. If you have to go back really far to see that the slope is decreasing then you have probably converged. <br /><br />Huh, well, it depends on the problem. Most of the time I find that ADAM makes things worse.Davis Kinghttps://www.blogger.com/profile/16577392965630448489noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-27211210676554394152018-04-04T09:09:47.209-04:002018-04-04T09:09:47.209-04:00By the way, I find that using an adaptive learning...By the way, I find that using an adaptive learning rate schedule with ADAM makes it work even better.chaoshttps://www.blogger.com/profile/02783193601301792115noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-74577203957495863302018-04-04T09:07:19.038-04:002018-04-04T09:07:19.038-04:00Ok I think I got it: You calculate P(slope < 0)...Ok I think I got it: You calculate P(slope < 0) for the last n loss values, for n in 1:N (N being the container size) and chose the maximum value of n (n_max) where P(slope < 0) < 0.51. That is, for all n in n_max+1:N, P(slope < 0) >= 0.51. (Or should it be minimum n where P(slope < 0) >= 0.51?)<br /><br />chaoshttps://www.blogger.com/profile/02783193601301792115noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-43237460553081850612018-04-03T15:42:06.403-04:002018-04-03T15:42:06.403-04:00There aren't N slope values, there is one slop...There aren't N slope values, there is one slope value. You find it using OLS. You can do a recursive OLS if you want. It doesn't matter how you find it so long as you find it.Davis Kinghttps://www.blogger.com/profile/16577392965630448489noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-10784209908801990612018-04-03T07:11:27.288-04:002018-04-03T07:11:27.288-04:00Hi, thanks for the post. I'm in the process of...Hi, thanks for the post. I'm in the process of implementing this. My current method is if the average loss of the last n batches, has not decreased for the last m times, half the learning rate.<br /><br />Just want to confirm one detail - it looks like you calculate the n slope values of Y using the values Y[n-i:n], e.g. there will be 2,3,4,5 -> n values used for the calculation each time. Why do it like this, instead of for example keeping a running score of P always calculated using the last n values of Y?chaoshttps://www.blogger.com/profile/02783193601301792115noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-53336145266091266842018-03-02T19:41:36.754-05:002018-03-02T19:41:36.754-05:00There is dlib.probability_that_sequence_is_increas...There is dlib.probability_that_sequence_is_increasing()Davis Kinghttps://www.blogger.com/profile/16577392965630448489noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-60957512227422755782018-03-02T14:36:44.649-05:002018-03-02T14:36:44.649-05:00Very interesting and seamless automation indeed.
I...Very interesting and seamless automation indeed.<br />I notice that the python module dlib does contain the count_steps_without_decrease function. However, regarding your last comment about resetting the training at an earlier stage, I don't see the P(slope>X) function being exposed. <br />It is in the c++ source and I wonder if it could be made exposed to the python side.<br />ThxUnknownhttps://www.blogger.com/profile/12861703566726724693noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-22911588658551389192018-03-01T07:05:24.890-05:002018-03-01T07:05:24.890-05:00Yes, you could do that and it wouldn't be awfu...Yes, you could do that and it wouldn't be awful. However, there are cases where it would put you into an infinite loop. Consider the case where the slope of the loss curve is asymptotically approaching zero but never goes positive. Simply thresholding m at 0 will never terminate but the test I suggested would. Most real world problems probably don't exhibit that kind of behavior very often, but these kinds of corner cases are things you need to be concerned with if you want to make a robust numerical solver.<br /><br />Slope(Y) is a Gaussian variable. So you just need to know its mean and variance and then you call a function that computes the CDF of a Gaussian. Every platform has such functions already available. You don't need to implement it yourself.Davis Kinghttps://www.blogger.com/profile/16577392965630448489noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-88034853884451257712018-03-01T05:30:54.176-05:002018-03-01T05:30:54.176-05:00Pardon me if this question seems elementary, But I...Pardon me if this question seems elementary, But I can't understand why we have to calculate P(Slope(Y) < 0)... <br /><br />Can't we just do OLS and get an approximate value for `m`, then whether it's negative or positive we take action ?<br /><br />What I wonder is two things:<br /><br />1. What benefit does calculating the probability have when we can have an approximate `m`?<br /><br />2. How exactly is the probability computed? I can understand the OLS but I couldn't get how you went from `Slope(Y)` to P(Slope(Y) < 0)... how did you calculate the probability?<br /><br />Thanks for this great Blog post...Mohammad Mahdihttps://www.blogger.com/profile/07742749102754409266noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-44689153594122522822018-02-13T12:51:53.167-05:002018-02-13T12:51:53.167-05:00Some models take days to train while others take m...Some models take days to train while others take minutes. You don't want to just always run for days, that's a waste of time. Conversely, if you don't let the solver run long enough you will underfit. How are you going to know, ahead of time, how long to run the solver? You don't. You need to measure progress while the solver is running and do something reasonable. Just running blind to the problem you are optimizing is never going to be a good idea.Davis Kinghttps://www.blogger.com/profile/16577392965630448489noreply@blogger.comtag:blogger.com,1999:blog-6061887630060661987.post-78466035885666447522018-02-13T11:53:28.992-05:002018-02-13T11:53:28.992-05:00Why it’s better than predefined Learning rate poli...Why it’s better than predefined Learning rate policy, like steps or polynomial decay?Boris Ginsburghttps://www.blogger.com/profile/16369746878070746246noreply@blogger.com