this lecture is an optimization for machine learning specifically we’re gonna be looking more at methods that work well for neural networks so the talk will be divided into five sections first I’m gonna give a quick intro and motivation for optimization methods in machine learning then I’m gonna talk about basic methods like gradient descent momentum methods second order methods next and then finally we’ll talk about stochastic optimization which is applicable to all the previous methods that we’re gonna discuss so optimization algorithms are the basic engine behind deep learning methods and machine learning methods in general and they enable models to learn from data by adapting their parameters so what they do is they solve a problem of minimization of an objective function that measures the mistakes that the model makes now this can be for example in classification problems this is prediction error where you’re comparing your predictions of say the label of an image to the actual values in reinforcement learning this could be negative reward where reward essentially measures how well you’re doing at the particular task and optimization methods work by making a sequence of small incremental changes to the model parameters and each step is more or less guaranteed to produce the objective by some small amount you accumulate enough of these steps in sequence and you eventually hope to solve the problem so just a little bit of notation upfront throughout the lecture I’m going to refer to the parameters always as theta denoted here these will be assumed to lie in RN this is the N dimensional real space in neural networks n can be you know sometimes hundreds of millions the objective function will be denoted by H H of so it’s a function of the parameters and you can see over here I’ve drawn a example objective function so up is the value of the objective function along the side is the value of the parameter of course if I’m drawing it in one in this two-dimensional plane I can only have a one dimensional parameter of course and in general this picture is much more interesting looking in higher dimensions but it’s useful for illustrative purposes just to consider the low dimensional cases here’s our optimum here’s maybe where we start and yes so the goal of optimization is to minimize functions of this form essentially by moving this magenta point down towards the optimum is as best we can so you know the most important example of this which arises in machine learning our objectives of this form so you have a essentially what is a sum over examples I is gonna index the example will have M of them so this is an average over those examples and it’s we’re measuring the loss between a target Y and the prediction made by the network so F here is going to denote then the neural network function which takes input X as well as the parameters output some prediction Y being the true value for the prediction that we want L is now a loss function which measures that is this level of disagreement between our prediction and the correct target so this is just illustrated here again this F is a neural network of course it doesn’t have to be and for the purposes of this lecture EF could be any sort of prediction model so next we’ll talk about the basic method of an optimization where you always have to start gradient descent so gradient descent method is as simple as it gets we so you’ll see throughout this lecture a you know a bunch of equations of this form this is showing an update rule for our parameter so at every step we advance to the next best guess for the optimum value of the parameters and we do this by taking the current value and then subtracting off in this case a multiple of the gradient to obtain the new value that multiple here is we denote by alpha that’s going to also be sometimes called the learning

rate or the step size and the gradient denoted here is just the standard gradient when you take the element-wise derivatives for each element of the parameter vector so what’s gradient descent doing why why should this method make any sense well one way to think of the gradient is that it’s giving you the direction of steepest descent in in the law surface so you know in this picture here again one dimensions well there really is only one direction you can vary you can go sort of left or right this is the direction that goes downhill so this is where the the negative gradient would be pointing in that direction we follow the negative gradient so yeah so so this is giving us the direction that that reduces the the objective value the fastest if you go in that direction now it’s not obvious that that that doing this should make any sense and indeed if the picture looks more like this and I were to start here and take a step that way I might not actually go down I mean I might start to go down but then I might start going up because the objective function curves upwards like this it curves up very very fast so this is bad of course I could remedy this by setting the learning rate lower that is the multiple that I take for the gradient but you know that might slow me down now this scheme is only gonna work if functions are ultimately smooth at some level you know when you zoom in far enough or at least continuous so another way of thinking of gradient descent is that and this is the way I prefer to think about it because it sort of leads into the other optimization methods quite elegantly is that we start so at any given point let’s just say this is where we are currently we’re gonna propose an approximation to the objective function for gradient descent this is a linear approximation so denoted by this line here the orange dotted line and here is this written in an equation this is just the first order Taylor series at the current theta and D is the direction that we’re gonna vary and we’re sort of modeling how H the objective function changes as we vary D according to a linear model again so this is a line now for small enough D this line you know matches the objective function pretty well I mean right around here it’s almost a perfect match and as you start to move away as your D gets bigger the match starts to degrade a bit but so if we were to minimize this model over a reasonable region around here that would give us a D which is this negative gradient step times some alpha and the Alpha would be determined by the region in which we’re allowed to move essentially so gradient descent is sort of the canonical algorithm but it has obvious problems problems that are sort of hard to illustrate in one dimension but if you even just go to two dimensions it already becomes apparent what’s going on so here and it’s important to understand this picture because I’m gonna keep using it throughout the lecture this what I’ve drawn here essentially is a sort of a two dimensional narrow Valley so you can think of this as a three dimensional object where up means the you know the value of the objective function so how high are you say in a two-dimensional terrain you can think of this as maybe like a hill or something and across you know so north-south-east-west is going to be the values of the parameter which is assumed here to be two-dimensional and so I’ve drawn this valley essentially where along the base of the valley there’s this sort of direction that we can move towards the minimum of the objective function but we sort of start out here at the side of the valley and we’re gonna if we’re gonna be running gradient descent with say a high learning rate what’s gonna happen is the gradient is gonna point this way right off the hill because that’s the direction that goes down the fastest and that’s gonna sort of balance the other side of that of the valley and we hit that wall and now we have the gradient pointing in once again in the gradient direction which now that’s going to be the opposite because of course the function goes down this way

so we’ll be balancing back here and if the learning rate is too high these steps are gonna oscillate wildly and eventually you’re going to diverge so this is the bad situation well East one bad situation for gradient descent now you could try to remedy this by lowering the learning rate so we’re gonna take these steps but they’re gonna be much smaller and we’re gonna bounce back and forth as we were before but it’s not gonna grow up get out of control the only problem now is we’re gonna be taking these very very small steps and once you get to the base of this valley structure and you still want to move along the base of it relatively fast well because your learning rate is small you’re not gonna be able to do that anymore so there’s no good choice in this situation for gradient descent I can make the learning rate high I diverge and make it low converge too slowly along these directions that curve very slowly so this can be described in theoretical terms which I’ll get into now so a a couple of common technical assumptions that we’re going to use throughout the lecture which have intuitive meanings which we’ll talk about are as follows the first is that we’re going to assume that HS Lipschitz continuous derivatives or aka Lipschitz smooth now what does this mean basically it means that the gradient doesn’t change too much as we change the parameters so a small change in the parameter translates to a small change in the gradient or in other words the function doesn’t have too much curvature so this L coefficient which tells us the relationship can be thought of as an upper bound on the curvature in the objective function and we’ll have a corresponding condition called strong convexity or of course we don’t necessarily need to assume this globally it’s good enough if it’s this is happening sort of locally around the area that we’re converging and in order to apply say for example to neural net objective functions which aren’t convex in general and this function this condition is given here and essentially it’s saying that the function curves is at least as much as this quadratic term which has a curvature of MU so this now gives us a lower bound on the curvature of the objective function we’re also going to assume for now that gradients are not stochastic I can get into stochastic convergence theory later but for now it’s simpler to think of it the gradients as computed deterministically that is without approximations so if you have those two conditions then this is the basic bound that you get with gradient descent so just to decode this a bit this this part here is the difference in the objective function value between where I am after K steps and where the optimum is and it goes down according to this function now this function depends on K up up here so as K gets bigger this exponent gets bigger but this wanted to use less than one so as this exponent gets bigger this gets smaller it gets smaller at an exponential rate and there’s some dependency on on how far you away from the initial point over here so the key quantity here is Kappa and that will determine sort of how much bigger than or sorry how much less than 1 this quantity actually is we prefer smaller values of Kappa Kappa is the ratio of the highest curvature to the lowest curvature and so this is what is often called the condition number although we’re taking a global definition of condition number so condition numbers usually describing a property of a matrix but you can also apply it to an overall optimization problem which is sort of the biggest Augen value globally for the Hessian everywhere divided by the smallest one and so perhaps it’s more useful to think of what this balances in terms the number of iterations that we need to obtain a particular optimal margin here that is to get to a an error between the optimal value and the current value of no more than Epsilon and that’s given by this expression so this is so K has to be roughly Kappa which is the condition number times log 1 over the air we like

to achieve so just some words about bounds are they useful in practice kind of there’s a lot of issues with them though so often they’re too pessimistic so they you know when you prove these things you have to consider all examples including worst case examples of course real problems aren’t necessarily worst case sometimes they make assumptions that are too strong say for example convexity now we don’t really need convexity to prove these kinds of things often at least if you’re looking at this sort of asymptotic behavior but it’s convenient and often times you can sort of describe what you’re doing in optimization theory you know you assume that your starting already close to the minimum close enough that the function looks comebacks oftentimes these theorems make assumptions that are too weak insofar as they don’t use enough structure from the real problem real problems have all sorts of interesting structure which are not captured necessarily by you know condition numbers or Lipschitz can l you no bounds things like this and you know these these bounds often rely on these crude measures such as condition numbers which are only sort of a vague description of what the function is actually doing finally and this perhaps the most important point is these bounds are usually focused on asymptotic performance so they don’t tell you necessarily or they don’t give you a reasonable idea of how fast you’re converging let’s say long before K which is the iteration number gets very large and in practice we often stop our optimizers before they converge so you actually do care how quickly you’re getting to that point pre asymptotically right so I would say in practice your choice of optimizer should be you know first and foremost informed by experience try different things but at the same time you know you can use theory to tell you at least help guide your choices in terms of different optimizers so these you know theory can develop intuitions which then translate into the real world and if you’re sort of nuanced enough that sort of that works out although not always be careful about anybody who says anything is optimal by the way so next I’ll talk about momentum methods this is sort of the easiest modification of gradient descent that we can make to have it perform better so the basic motivation is you know as we saw in the example of the the valley the two-dimensional valley gradient descent can flip back and forth when it’s when it’s when the learning rate is large so and of course if the learning rate is small you have the opposite problem that you don’t move along these low curvature directions say the base of the valley you don’t move along there fast enough so the key idea with momentum is that we’re going to try to accelerate along these directions of low curvature let’s say at the base of the valley and we’re gonna do this essentially following a physics like equation of how momentum works in in real physical situations so in particular you can think of the parameters like a ball rolling along the surface the surface is defined by the objective function and this ball is subject to the force of gravity so as it starts to roll along a direction let’s say the base of this valley here it accumulates momentum this is actually the best illustration that I could find on the Internet unfortunately because you can find a lot of videos of balls rolling down hills but what’s important here is that it’s hitting the sides of this valley right it goes up and then comes down and it’s that suppression of oscillation that is really the important thing that’s happening momentum right anybody can just move faster but if you’re if moving faster it causes you to go off the side of this hill you’re done essentially but here the ball is rolling in such a way that it’s not going over that and it’s stayin along this valley but like it should but at the same time it’s picking up speed as it goes down so that’s that’s the idea that we’re gonna try to exploit with momentum so here are the equations for momentum the basic version is that we have a velocity vector V and we’re just going to keep adding the gradient to it essentially well and also

decaying our current velocity by some factors and we’re gonna multiply that recurrent velocity by some factor this is you can think of as friction in a physical system so the velocity is not just preserved perfectly it sort of goes down over time but we’re always adding this force to it essentially and so and then once we have our velocity we just add that to the parameters times some learning rate as before there’s also a different version of that which I won’t really get into but it’s a slight tweak of the basic version that’s more useful in theory and sometimes in practice so returning to this 2d Valley example again we have once again this is the situation with gradient descent where you can bounce along the sides but eventually you bounce too much and you diverge or you take the learning rate that’s too slow or too low and now we’re not we don’t oscillate out-of-control but we move too slowly you can think of this as say in that video the ball if it never picked up any momentum it would just sort of crawl along you know after somebody had pushed it it would never actually get any faster but what momentum allows you to do is use you sort of you go you you hit the other side but immediately this vector here is this is remembered in terms of velocity so the ball has velocity going this way and then it hits this side and there’s something pushing it the other direction but that cancels out its initial velocity so actually the ends up just going straight and it comes down here and then it starts to roll back this way but again it’s canceled by the gradient which is pointing that way so it’s sort of never is able to oscillate out of control meanwhile there is one direction which is always pointing downhill and that’s this one and so velocity keeps accumulating in that direction and we haven’t and that and therefore we get to our goal in fewer steps so we can justify this stuff theoretically as well so given an objective function satisfying the same technical conditions as before that is we have this upper and lower bound on curvature across the objective function we have this bound this is very similar to the one before except now this term here depends on the square root of Kappa not just Kappa itself but otherwise it’s identical to what we had before and what you essentially get out of this is that the number of iterations needed to achieve a certain air is roughly this expression where we have log 1 over epsilon Bo as before but now there’s a square root in front of the capless so we’ve improved our dependency on this Kappa term by lowering it because of course this is a quantity that’s that’s greater than 1 and now we we’ve got a yeah we’ve got this better dependence so for problems where this is large this is gonna make more of a difference to use momentum so we can then ask is is this as good as we can do we’ve added velocity to mama to gradient descent you know could maybe we could sort of add some sort of acceleration term that’s also preserved some higher order effects well it turns out in some technical sense this is the best we can do so I’ll first start off by defining first order method technically so this is the term that gets thrown around a lot but there is a technical definition and this is the one that it was for example proposed by Nesterov although I’m pretty sure it goes back much further than that and it’s essentially that the difference between parameters or in other words these steps that we take these DS at any consecutive iteration is given by or is contained in rather the span of all previous gradients that we’ve seen so these are span just means the the set of linear combinations of these things so in other words we’ve added various multiples of previous gradients to each other to arrive at where we currently are that restricts you to a certain class of algorithms and but this is actually a pretty interesting class it includes for example gradient descent momentum methods as well also conjugate gradient methods which are typically only applied to quadratic problems although there are sort of nonlinear versions that also fall into this category so what in particular is not included in this definition are preconditioned or

second-order methods which we’re going to discuss later so given this definition of first-order method we can sit we can now ask well is you know is how well can you do with first-order methods and it turns out that we actually now have a matching lower bound so we had an up or down bound from before on the performance but now we have this lower bound which says that you cannot converge faster than this and this looks a lot like the term that we had before and in particular it requires that the number of iterations to converge is of this form and this is the same as the upper bound that we had for momentum so in some sense momentum methods are quote optimal of course this is only worst-case optimal and although caveat s’ that I gave before do apply but you know you at least know in the worst case that you know there is no major algorithmic improvement that you can make at least if you keep yourself inside of the class of first order methods yeah so just getting back to the are bound so far we have this worst case lower bound for first order methods this is the band that we got for gradient descent here we have the kappa term without the square root so this is a worse worse than we we can get and in practice gradient ascent does get this even though it’s an upper bound this is actually a fairly good you know it’s a tight upper bound for gradient descent and the upper bound for good gradient descent with Nessarose momentum is this so it match it matches the lower bound okay so that’s first order methods and next I’ll get into second order methods so so how are we doing for time sorry well we’ve got plenty of time so second order methods are sort of the next step in optimization methods beyond first order methods the big problem that we had with first order methods was this dependency on this sort of global condition number Kappa this in particular the number of iterations that we had scaled as some function of that so Kappa is the ratio of the maximum curvature again over the minimum curvature and for certain problems this is going to be very big say for example certain kinds of deep architectures although surprisingly networks like res Nets actually this number is really not that bad at all and that’s why for example you see people use regular gradient descent in resonance with a lot of success but there are certain kinds of models for example models that I’ve been exploring more recently people at deep mind have got into certain physics inspired neural networks that are harder to optimize and classically we had a bunch more of these networks around before people everybody started using resonance that are hard to optimize so so in practice you know this might matter it but it might not and it really depends on the problem but we’d like our optimization methods to be as robust as possible not just break down if our problems become too hard in some sense so this is worth trying to improve the situation and second-order methods do allow us to do this they allow us to improve or sometimes even eliminate the dependency on Kappa and we get similar bounds but now the Kappa term is vanished so the the basic idea with second order methods is to essentially return to the approximation idea that we had before for first-order methods so we were going to locally approximate our objective function by a simpler function now before we had a linear function which was a straight line now we’re going to replace it that with a quadratic function and the easiest way to do that although not necessarily the best way is to take the second order Taylor series around the current point so the second under Taylor series is locally anyway the most accurate quadratic approximation you can make to the function and if you were to minimize this approximation to the objective function of with respect to D you get an update of this form which is the negative of the Hessian inverse times the gradient so it requires you to compute this Hessian matrix and take its

inverse and multiply that by the gradient and then the basic update iteration is the same as before for gradient descent you can also augment this type of equation with momentum as well and that can sometimes give you an additional boost it really depends on in some sense how well you’re approximating H and and and if you’re you know if you’re doing a perfect second-order method in some sense momentum is not going to help you because you’ve already eliminated the dependence on the condition number but if you’ve only improved it a bit you could still get an additional boost from momentum but we’re just gonna assume that we’re not using momentum for the purposes at least of the theoretical discussion of second-order methods although in practice people use momentum quite a bit with second-order methods so now we can return once again to this example so here we had gradient descent I’ve just shown the picture for the small learning rate which is the one that doesn’t diverge momentum was able to you know help us get around this oscillation issue without sacrificing our ability to move fast along the the base of the valley second order methods are you know quite elegant in this that they just model this curvature so a second-order method actually sees that the both sides of the the valley curve upwards like this models that and it goes straight to the bottom and then once it gets here it sees oh this is actually a very very sort of smooth pathway that and then in other words it’s quite flat and it’s going downhill at a reasonable rate so I can just instantly accelerate in that direction so right so another way to think about the relationship between gradient descents and second-order methods is to think of gradient descent as a kind of primitive second-order method so when you’re doing gradient descent the maximum allowable learning rate is one is up to a fudge constant 1 over L where L is quantifying the maximum curvature like before and so given this learning rate which is the maximum when the one that we can tolerate you can think then a gradient descent as like a second-order method where we start out by proposing to use the second or Taylor series approximation for the objective function and then minimizing over that but then we because we don’t like the session term it’s too complicated we substitute that with L times the identity matrix so when you do that substitution you’re essentially replacing the Hessian with a term that says the curvature is maximum everywhere as opposed to trying to distinguish between directions that have high curvature let’s say the sides of the valley and the directions that have low curvature in other words the base of the valley so all the all directions are treated as having this maximum curvature and when you do that well you don’t then you don’t get to see that the base of the valley is actually quite smooth and in a good direction to accelerate in you just have to move slowly in all directions and so L I in some sense is it’s just too pessimistic it’s of and it’s too crude of an approximation for this sort of our altered second-order method to perform well so signature methods sound great but there’s a lot of catches and the first one which is actually pretty easy to handle but is very important and often overlooked almost criminally so is that this idea of using approximations well it has the same problems that gradient descent has but in some sense you need more machinery to deal with them because the approximations are in some sense pushing you further so what I mean by that well if you if you think of again this think of this example here where are the Purple Line is the true objective function and over here we might take the second order Taylor series and this is a good approximation to the function locally but a second order you know approximation could go off like that it could be wildly inaccurate as we move away from from our current point so you know because gradient descent in some sense is taking the maximum curvature it can never actually go to wrong when it comes to it’s it’s you know the minimization of its implicit second order approximation but if you’re using the Hessian you can go wrong because a direction that might start out as having

low curvature if you go too far in that direction it might start to curve upwards abruptly because again you’re your law service is not perfectly quadratic so this kind of thing can happen just you know say for example in the valley you know we could have right at the you know close to the bottom of the hill you could suddenly have this ramp that goes up and you don’t see that locally until you get close enough so we can’t move too fast with second-order methods and the key idea then is to restrict our updates into a region around the current point but unlike our methods you know actually worth you know doing implementing this is gonna be a bit trickier in practice so how does this look well you can start out by defining again a minimization problem over the quadratic approximation but restricted to some region and it’s usually convenient to take a region which is essentially a ball around zero for our for our update vector D of radius say lowercase R now it turns out that in many situations although there’s sort of technical conditions that have to be observed but usually this problem becomes equivalent to a problem where we’re just minimizing globally over a new quadratic but where we’ve added this multiple of the identity to the curvature matrix or the Hessian here and so of course we know how to solve this problem that’s just the inverse of the matrix times the vector times negative one so okay that’s true for some lambda now actually working out lambda can be tricky but we don’t really need to do that in practice we can just at least you know we don’t have to worry that about R and its relationship with lambda and talk about talking talking about a sort of each step can be computing a lambda for our given R we can just work with lambda directly we can just say I’m adding this value of lambda maybe it’s too big maybe it’s too small and there are ways that you can adjust this in practice various heuristics that are inspired by algorithmic works that you know that people often use so I for example use a method called eleven bookmark card method which allows you to sort of adjust lambda on-the-fly so there’s another thing about second order methods which is sort of important to talk about which is that the Hessian might not actually be the best matrix and this I think a lot of people find really counterintuitive and this comes up a lot in neural network research where nobody uses the Hessian even if you could even if somebody gave you an Oracle to compute the inverse Hessian you wouldn’t necessarily want to use it and it’s kind of hard to understand that but I think it’s worth thinking about you know what what makes a good quadratic approximation to the objective function right I mean the the Taylor series the second order Taylor series is locally optimal right in the sense that it gives the most faithful approximation of the lost surface in a sort of a small vicinity of the current point but I mean that’s not what you want so say for example you might want an approximation that gives you sort of a more global view of the objective function so again here purple is the objective function this might be our second order Taylor series approximation but there is a different approximation which kind of gives you a better global view and if I was to minimize this approximation had actually been doing much better even though it’s not necessarily that accurate out here it’s still bringing me to the sort of the right rough area so in some sense it’s capturing more of a global structure in the objective function I might also want my approximation to be more conservative so say you know for example here this is the same example on the previous slide where we had we’re talking about there the trust regions now orange being our Taylor series approximation but this green one here if we were to minimize this one we’d get over here if we were to minimize the orange one well we’d get out here somewhere and of course the objective function is curved up long before that happens so if you were to move out here you’d essentially you’d be you know the objective function now values now shot up to infinity or something and that’s no good so there are definitely situations where you might want to use a different

quadratic approximation to the objective other than the second order Taylor series and when we find this in practice so the most important family of examples that that myself and others have found for neural networks are the generalized gauss newton matrix the Fisher information matrix which is often related to the first one there are in fact often equivalent for certain kinds of losses and there’s also the empirical Fisher which is a kind of a weird approximation of the Fisher information matrix it’s cheap to compute but mathematically it’s a bit dubious so some nice properties of these particular matrices versus say the normal Hessian well first they’re always positive semi definite so there’s no negative curvature in the matrix itself now that’s good because of course if you have negative curvature and a quadratic approximation that is kind of telling you that you can go infinitely far of course if you restrict yourself to a trust region you solve that problem kind of but it is nice to have an approximation which just even without the application of trust regions gives you a minimization problem that actually has a reasonable minimum they and also you get some you you open yourself up to a wider class of theoretical results if you can assume that your matrix that you’re multiplying the gradient by is positive semi-definite another interesting fact is that in fact if you take small enough steps that is you make the learning rate small enough you are invariant to any Reaper is a ssin of the objective function if you use one of their at least the first two actually know yeah all three of these matrices have that property so many people will know that Newton’s method is invariant to linear Reaper motors ations of the problem but these method methods based on these matrices actually are invariant to any smooth repair motorisation if you take small enough steps and that’s just not true of the well it really depends on how small you mean but that happens much faster anyway for these methods and finally and this is just an empirical fact it works better in practice for neural nets and there isn’t a total comprehensive understanding as to why that that’s true I like to think that some of the intuitions given on the previous slide and these observations are important but nobody has a fully comprehensive story yet about this so I’ve gone over sort of the common problems with second-order methods and in these old ways you can change them new matrices you can use but there is a you know a huge elephant in the room the second-order methods which is just that these matrices the Hessian or one of these alternatives that we might want to compute are huge so in for neural networks for example you know that the dirt the prim dimension should say of the parameters can be in the tens of millions so that means now that we have a 10 million by 10 million matrix say that we have to compute we have to store it and then we have to actually invert it and that’s just totally out of the question as n gets into those ranges so the common solution and the one that we’re going to use we’re going to talk about in this lecture anyway are approximations at the objective the matrix itself although there are I should point out though that there are sort of a different class of methods which instead of approximating the matrix they just approximate the problem of minimizing the quadratic so they don’t perform an exact minimization and therefore they don’t need to compute an inverse but though methods have sort of become less popular in recent times and so approximating the matrix is sort of the easiest and most effective thing you can do so the first approximation I’m going to talk about our diagonal approximations and these are the absolute simplest things so what you do is you just take the matrix that you have and zero out all the non diagonal entries so

inversion and storage super easy right because now you just have n entries and you need to invert a diagonal matrix you just take the reciprocal of each entry so that’s trivial computing these matrices actually it’s slightly non-trivial but it really depends now we’re getting back to some of the different choices the Hessian the gauss newton the fischer depending on which one you choose there can be different computational costs associated although I should say that for any of those choices there are good approximation algorithms that will get you the diagonal but not exactly but for the empirical Fischer actually it is quite cheap to get it exactly so now of course the obvious problems with the obvious problem with this method is that it’s a very primitive approximation and it’s really not going to give you anything unless there are obvious sort of access aligned scaling issues and so what do I mean by that well if you think of the 2d value example again you know if one of those directions say that that you know the high curvature ones that goes on that sort of hits either side of the the valley if that’s one parameter and then the other parameter is moving exactly along the base of the valley well that would be the perfect situation for diagonal methods cuz they could they could you know you could you in fact the true curvature is diagonal in that situation but in general you don’t have that in general different directions of curvature different eigenvectors of the hessian or whatever matrix you happen to be using are not going to be aligned with the coordinate axes and so the the matrix itself in particular is not going to be diagonal and the consequences of that can be severe and I’ll sort of erase any advantage you might get from using second-order methods nonetheless they are pretty popular so if you take the square root of the empirical fisher which is a slight fudge to the algorithm I view it as a way of sort of compensating for the crappiness of the diagonal approximation and therefore sort of hedging your bets by being more conservative you get RMS proper Adam we’re at which are actually quite popular optimization algorithms to use for neural nets now one step above diagonal methods are block diagonal methods so block diagonal method instead of zeroing out all the non diagonal entries we’re just going to organize our matrix into our parameters I should say in two sort of groups and then each group is represented by a full matrix but relationships between different groups are not are not modeled in our matrix and so we zero out all those entries so in a neural net blocks could be say for example all the weights associated with one particular layer or one particular neuron and those will give rise to different block diagonal approximation schemes so these are still fairly cheap depending on how big your block is the storage cost is just number sort of the B here is the block size I’ll assume just for the for simplicity that all blocks are the same size so this would just give you a storage cost which is B times n so you’ve only increased your your storage cost over diagonal methods by a factor B the inversion cost is B squared times n so that’s quite a bit worse than a diagonal case but again if B is not too big that might be reasonable it’s all it’s basically just as difficult to compute this once you get around the you know the additional storage versus the diagonal case but like I said it can only really be applied in the case where B is small and let’s say if your blocks are the parameters for an entire layer well that’s still millions of parameters sometimes and that might just be way too big to deal with so one method which is probably the the best at this is is something called Tonga although to be frank block Daigle methods in their raw state haven’t really been popular for many years but this is sort of this is sort of the go to work on that now one way you could improve block Daigo methods are so called Kronecker product methods they’re Kronecker product approximations so if we start out with a block diagonal approximation of the generalize gas Newton or the Fisher

where your blocks are corresponding to whole layers which are like I was saying before too big to be treated naively and then you’re to further approximate those blocks with the special algebraic form which is called a chronica product then you get this approximation so what is a chronica product a Kronecker product is well it’s it’s it’s denoted like this eight a times C and in terms of act the actual matrix that you get it’s essentially created by taking multiple multiple copies of C and tiling them over and over and over again and you tile them once for every entry of a so you create this much much bigger matrix out of two small matrices and that seems like an arbitrary construction but actually it’s sort of a Rises very naturally when you start thinking about neural nets and approximations although I don’t have time to get into that exactly how that happens but you does and what it allows you to do is actually do much better in terms of storage and computation now this is a type of this is not oven I don’t know why I wrote that must have copied and pasted it pasted it from a slide but it’s not it is more expensive expensive to store these matrices approximations then a simple diagonal approximation is but it’s not that much more so the cost of applying these okay I see why I wrote that it there are some circumstances where that might be true but there are also some circumstances where this is sort of not really accurate it’s too difficult to sort of get into because that you have to sort of get a more fine grained analysis but you can think of it as most of the time being roughly the same now the cost to apply an inverse is well it’s B to the 1/2 times n so that’s just a little bit more expensive than a diagonal matrix approximation again B being the number of parameters let’s say in an entire layer so this could still be you know well B to the 1/2 let’s say if you’re if you have got million a million parameters and in a single layer you know you still have a thousand factor here so it’s not nothing and this gives rise to what I would argue is the most powerful neural net optimizer k faq its most powerful it’s also a little bit heavyweight but it does optimize difficult and that’s the most effectively and so finally I’m going to talk about stochastic methods and so throughout this lecture I’ve been talking about deterministic optimization methods mostly because it’s just easier to talk about them the theory is nicer and a lot of the intuitions that you build when you consider deterministic it’s apply in the stochastic case partly because well if you take a mini batch large enough a stochastic methods sort of looks like a deterministic method but I’m getting ahead of myself I haven’t even defined that yet what a mini batches so a so you know a typical training objective which we saw before consists of a sum of okay now this really is a typo these two should be reversed right so this is a typical objective function which is an average of a bunch of individual losses for each let’s say each training case although in general you know there can be other ways that you can get this kind of form arising in machine learning and so that means that our object our gradient is the sum or the average of these individual gradients and well if M is very large which m being the size of our data set this computation could be just way too expensive to always run and so the idea with stochastic methods is that we’re going to well observe that you know these eight these individual objective functions from each training case are not totally independent right there you know they say for example involve a neural network learning how to make a prediction not every single training case is totally different from every other training case there’s a lot of overlap in terms you know the tasks

that you’re trying to solve and so and this is especially true early in learning so you can imagine in a neural network you know the first thing that it has to learn are the basic properties of the data set the simple sort of statistics of the images let’s say in terms of their means and their variances and then maybe it starts to distinguish between sort of course categories like cat and dog but it hasn’t yet sort of learned all about the flying distinctions between different breeds of dog so is often you know easier at the beginning in this sense and therefore the and the the overlap between different cases is sort of stronger you know in other words the the cases are telling you not you know they’re not that fine-grained information is not as important and that intuition really does carry through you do see this in neural net optimization that stochastic methods when you start optimizing they behave very much like deterministic methods so in this in this sort of in this correspondence degrades over time as you as you converge as you start to converge so the idea was with stochastic methods is that we’re going to yeah we’re just going to take a subset of the training set so we’re not going to take all mm cases we’re going to sample some random s and they just average over these B being the size of the set s so this gives us some kind of stochastic approximation of the gradient and in fact it’s a unbiased stochastic approximation so stochastic gradient descent is then defined just like gradient descent was but we have our stochastic gradient in place of the true gradient and this method right off the bat actually just won’t work precisely it’s not even not even going to converge unless we do one of the following things so one thing you can do is you can decay the learning rate and there are specific ways that you specific formulas that you could use here where you know the value go essentially goes to 0 as K grows this is a form that sort of elegant and works well in theory and you prove theorems with this kind of formula in practice there are better formulas that you can pick but this at least is sort of a simple baseline and every formula that you are gonna pick is sort of going to be roughly inspired by this one now and this is are getting back to my hold you know theory versus practice you know there’s one thing that the theory says you should do and practice often you know by exploiting additional properties you can do better so another perhaps better I would argue better alternative is polyak averaging so this involves taking an average of all the parameter values that we visited in the past seems kind of like a silly thing to do because you know the initial parameter value might actually just you know it’s just our random starting point which is not particularly significant at all you know but it right but but it’s nonetheless you know as you start to take an average over more and more things you know the dependence on that point fades you could encourage that to happen by taking a kind of a decayed average an exponentially decaying average so this is a type of average which decays faster than a normal average does in terms of its dependency on the starting point but it’s the the theoretical things you can say aren’t quite as good or at least the theory isn’t quite as elegant for this case but this is what people do in practice and it works better so this will allow your stochastic method to converge another thing you can do is you can actually just increase the mini batch size during optimization and if you do this sufficiently quickly people have shown that that actually gives conversions as well so there’s a bunch of options here and oftentimes the best thing to do comes down to you know really just running the experiment and trying different things until we at least until we have a much better theory for this kind of stuff so stochastic methods in general are going to converge slower than their non stochastic counterparts and this is kind of obvious I mean you’re just taking the gradient and replacing it with some kind of noisy approximation so you’re basically just taking

a good algorithm in your cropping it with sort of noisy data but it’s not that bad so this term or sorry this this formula is what you get if you do stochastic gradient descent with polyak averaging so I haven’t defined this matrix but this is just you can think of this as the covariance matrix for the gradient estimates because again our gradient estimates are stochastic quantities so they have a covariance matrix and you can just compute that and if you multiply this by the inverse Hessian at the optimum take the trace multiply one by one over K and then you add some higher you know order terms which are going to decay faster than one over K because again this is constant the only dependence on K the iteration number is over here so these terms well they can matter but this is the asymptotically dominant term and that gives you a asymptotic convergence which for sufficiently large K is essentially going to scale like this so to get our epsilon error this is the form that we get and you’ll notice that there’s no log here so before with deterministic methods we had the the log here here there’s no log so this actually this dependency is much worse and you can think of again you can think of essentially in a stochastic method the air is going down as 1 over K right 1 over K in our deterministic methods they were it was going as a exponential function of K so so this is much worse in some sense but actually these methods do quite well in practice and part of the part of that comes down to various ways that you can mitigate this sure make this term small and actually these terms end up being larger than you think and so oftentimes maybe your true optimization is dominated by this but it won’t be true asymptotically not an interesting thing that I think is worth pointing out is that it’s been shown that this is as good as you can do so asymptotically this form is as good as any algorithm can do ever and essentially the way you prove that is by just arguing that if you have only seen a certain amount of data there’s an intrinsic uncertainty in your parameters right you don’t know what the true value is because you literally there’s no way to disambiguate it given what you’ve seen so far and that intrinsic uncertainty is the kind of error that you would get with with this term here so so SUV with polyak averaging is actually optimal in a very very very strong sense but it’s only asymptotically optimal and again asymptotics are not always the whole story so you could apply second-order methods with stochastic gradients and people do and you know so the there are sort of tricks of the trade to make this work now when we’re computing curvature these these you know these Hessians or these Gauss Newton matrices you know we have we have the same problem that we have with computing the gradient right we don’t want to compute it over the entire dataset that might be too expensive so the common thing done in practice is to take take a decayed average that’s just where again you have a running value that you sort of updates as you go and it’s in its approximate self as an approximation but it’s often good enough so it is worth pointing out though that you know based on the discussion that I just gave on the previous slide there is no faster method than SGD with polyak averaging so asymptotically we cannot hope to get any advantage out of doing this but pre asymptotically yes it matters and just going back to the slide preview this term which hides a lot of dependency in which for example unlike this term you know this couldn’t depend on things like the condition number and and rather the pre-conditioner that you use I shouldn’t say condition number so that the any improvement you make to the condition number might be reflected here it won’t be reflected here this term does not depend you get the same expression even if you use a second order method and there’s no improvement so this term literally does not depend

on the Hessian or whatever matrix you end up picking but this term does and this term can be improved so when would you expect this to help in practice well if the the law surface is if the curvature is bad enough in other words for example the you know the condition number might be very big although condition number is only one measure of sort of badness the or the mini-batch is large enough so if the mini-batch is big you’re naturally gonna have a low variance in other words this term here whoops this term here is gonna be small and then this whole thing doesn’t matter as much as it did before in your in terms of your total error so those are two ways that it can still help and if you have a combination of these two things going on then there could there can be an advantage and in fact this is a graph that was produced very recently which I feel was sort of the ultimate vindication of this kind of research so people these days train resonates if they’re doing deep image classification almost exclusively but you can consider networks that are say 100 layers deep but don’t have skip connections don’t have bachelor-man have the usual tricks and this gives rise to a much harder optimization problem because it turns out those tricks are actually helping make the optimization problem easy a much easier so if you if you have such a network and you initialize it carefully then let’s say and you pick a batch size which is you know not crazy in fact this similar result holds for a much smaller batch size say 64 then in fact there can be a huge advantage to using a second order method like K FAC versus momentum methods or atom which is a popular diagonal second-order method but it’s interesting to note that if you to run the same experiment on a resident the differences vanished completely so all methods perform almost identically and that’s what you expect from the theory if the condition implicit condition number of a ResNet was really really good and it seems to be it’s good enough anyway that the asymptotics predicted before sort of are the dominant factor and so then it really only matters how much data you’ve seen yep yes yeah so it is it is typically slower but it it really it really depends on what kind of second order method you’re talking about so diagonal methods are almost you know no cost it would be about half the speed I would say but it depends on how you optimize it so people have done work on mitigating these overheads and it can go down you know to like 10% slower for example depending on the different trade-offs so I it because those things always depend on implementation details I tend to not talk about that but yeah you can get these overheads down quite a bit and this you know this difference by the way will never be made up by like a 2x like this graph will go out really really far it’ll almost never catch up to that in fact these networks are basically impossible to optimize with first order methods at least to the same level so so yeah so so so these methods can make a difference in certain kinds of networks but there is this sort of tension in the community between making the networks easier to optimize and just making the optimization technology better both are solute solving the same problem in some sense but it’s nice to have more than one solution and I hope that you know by embracing these more powerful methods that might open up new classes of models that people wouldn’t necessarily been have been able to optimize before so now I’ll just go over some wrap-up and conclusions for their lectures so I talked about optimization methods and how they’re important in machine learning they work by adapting the parameters to minimize an objective function and they’re the main engine behind neural network learning we talked about second-order method first order methods such as gradient descent the key interpretation being is a steepest descent method or also as a kind of a first-order approximation that that you minimize locally we saw how this can run into some issues when the curvature

varies in two different directions let’s say the base of the valley versus the sides of the valley we talked about momentum methods which allow us to accelerate along these directions of low curvature let’s say the base of the valley and in fact are optimal in a certain asymptotic sense amongst any first order method you could propose then we got into second-order methods we talked about how these can improve problems associated with bad curvature how they can eliminate this dependency on the condition number or at least improve it although coming with a whole bunch of caveats so for example that you need to use stress regions or damping methods for them to work well you have to consider alternative curvature approximations there’s alternative curvature matrices and then you also have to talk about approximations of those for this to be practical in neural net training for example finally we talked about stochastic methods which use mini batches of data to estimate gradients and possibly curvature we saw how these are sort of asked product ly slower than deterministic methods but how they’re pre asymptotic performance could in principle be sped up with the use of second-order methods and we saw an example of that in practice and that is the end of the talk so I think we’re doing pretty good for time I was about 10 minutes early so I’m happy to answer any questions you might all have and I also have some references at the end here in case you’re interested in learning more yep yeah so oh should I repeat the question okay so you were asking about initialization methods what are the optimal ways of doing that so I would say initialization is a topic that’s picking up a lot of steam recently it’s been something that I think was brushed aside for years and now you’ve got a bunch of papers coming out that are tackling this the initialization method that I talked about in that slide when I said careful initialization that’s been sort of my two-year long epic project so it’s actually very very hard to initialize a deep network like that and have a train as fast as a regular Network well like a writer when I say regular I mean a resident there are you couldn’t get almost arbitrarily complex with this stuff it’s a very deep subject and I think the most exciting results are going to come out this year people you know if you’re using something a package like tensor flow you know you’re using a default initializer typically the typical rule is to is to take a Gaussian and multiply by 1 over square root F where F is the fan-in factor for that layer and that in almost every initialization is starts from that you know that basic point but there’s much more than you can do beyond that point yeah it so it’s it’s it’s important it’s very important in particular you know if you’re solving you know if you’re if you’re trying to get rid of things like batch normalization or skip connections you know you can use a better optimizer to solve the optimization problem associated with doing that but you still have to solve the initialization problem because those methods also as it turns out were fixing bad initializations so if you fix both the optimization and the initialization then you can get rid of those things by analytic you mean you can compute them sort of locally if you know like there’s an actual formula for them yeah yeah they are so they’re not like quasi Newton methods because quasi Newton methods of course depend on this history of iterates yeah I know these these algorithm these matrices are sort of well-defined at every point the only reason you’d ever accumulate data to compute them is just because you want to get a more cystic livre bust estimate of them but this in principle you could just throw out your old estimate and computer an entirely fresh estimate

where you are yep yeah I mean it’s complex you know there are different ideas out there one idea that I find compelling which is a paper recently published from deepmind is is that the skip connections are in some sense making the the network look like much more like a shallow network and shale networks are sort of easier to optimize and that you can sort of slowly recruit more non-linearity into your model as you as you go instead of the skip connection plus batch norm architecture enables this it you know you’re it’s good you’ll be hard-pressed to find a sort of a fully rigorous story here although I think the you know the story it things are moving along in such a way that what I just said it is almost certainly true spiritually anyway the the initialization method for example that I’ve been working on follows a similar sort of principle in the sense that it makes the network start out very linear looking and then sort of out but in such a in a careful way that allows it to sort of become more non-linear gradually so you can actually do this analysis based on kernel theory where are you so you can actually really see with high probability what a neural network will do if you keep adding layers on top and what you quickly observe is that the sort of the way that the neural network sort of maps the input to to its output is a function that sort of degenerates very very fast unless you’re extremely careful with how you set the weights at each layer and the burden of having to do that having to set those weights carefully becomes harder and harder and harder as you keep adding more layers your algorithm has to be more and more precise so using default initializations which are quite primitive you know we get away with training shallow networks no problem in a resonant because it looks shallow enough it’s in some sense it’s it’s hiding all this depth it those naive initialization methods are good enough for a resonant but if you yeah if you can solve the optimization problem associated with deeper networks then these differences are in some sense not important once you use a good enough initialization we can train so one thing I didn’t plot on this slide here is that if I was to plot a regular ResNet it would follow this orange curve exactly so we can get these we can get the same optimization performance now without without the resonant architecture but you have to do a lot yep yeah well so you know the community’s gone a bit back and forth on you know the importance of stochastic gradients in terms of you know adding regularization into the problem it’s certainly true that it does that’s sort of undisputed but there’s some disagreement over how important this effect is if you look at you know the the modern convergence theory for deep nets which actually predicts that the loss surface is essentially a convex quadratic within the vicinity of a good initialization point so really all this theory that I talked about actually is applicable because really the function is more or less convex at least in the neighborhood that you care about not only is it convex its quadratic at least if you use a certain type of loss and yeah so that theory sorry now I got I lost your question again was about oh yeah oh yeah yeah regularization yeah yeah so if you know if if you are in that situation where the where the objective looks like a convex quadratic you know really there’s there’s only one minimum that you’re ever gonna get to right so if you’re stochastic method converges it’s gonna find the same solution as your deterministic method would have now of course stochastic methods don’t often converge or they’re not taken to converge then you could sort of ask well how good is my unconverted maybe maybe the lack of convergence is itself a form of regularization probably true but not a very big source so you know these kinds of results that I’m plotting here you do lose a little bit on the test set but it’s it’s a few percent and you know I would like to believe that you can make up for those differences say by adding more explicit regularization to the problem it’s a bit dubious in my opinion to rely on the OP

to perform the regularization that sort of classically been the job of the objective function right the optimizers job should just be to optimize yep yeah yeah that’s very hard I don’t think anybody’s really come up with a easy way to do that yeah yeah I mean you could try to measure the condition number or some related quantity I would argue the condition number is sort of too primitive of a quantity to really tell you that much because for example you don’t care about the minimum curvature really that the function might be totally flat along certain dimensions and that doesn’t matter at all right so you your condition number could be infinity but the sort of the the true condition number which is the quantity that only cares about directions that actually matter in terms of the error value might be much bigger than you know it much might be much smaller than that I should say smaller than infinity so yeah so condition number is problematic right off the bat you could try to you know start computing the eigenvalues of the Hessian but you know then you get into the problem that those values might are you know are describing what’s happening right where you’re initializing but maybe you know the curvature evolves as you optimize and then it’s hard to predict how it’s gonna look you know halfway through the optimization until you actually have run half of an optimization but by that point you’ve already starting to do this empirical evaluation of your method so yeah no I I think you couldn’t you know you can use intuitions you can say well deep network the deep the deeper the network is the harder it is typically you know unless you’re using skip connections in batch norm and then you sort of mitigate that oran ends are harder than feed-forward nets typically because they basically look like very deep nets without skip connections so there’s there’s intuition so you can apply but in general it’s yeah I don’t have a good answer for that yeah this one sorry between the between H and between these ones well yeah this is the covariance of the gradients and this is the Hessian it’s interesting that in fact sometimes those matrices will be equivalent that actually does happen in certain types of problems under certain technical conditions so in fact this term can sometimes become trace of identity but in general that two matrices will be different well I mean so that the covariance of the gradients is kind of like the empirical Fisher and the Pyrrhic of Fisher is kind of like the Fisher which is kind of like the gust Newton which is an approximation of the Hessian so yes kind of there are all sorts of situations where that will be a bad approximation where you can show that it’s a bad one like so bad in fact that it’ll cause the algorithm to sort of very reliably fail nonetheless there are situations where it’s good and it really comes down to the particulars don’t clash more questions um I’ll go over there you were next yeah well I mean I think I think these intuitions that we get from lower dimensional cases have proved useful you know in the case of you know if you’re talking about difference of curvature all you need are two dimensions and you can start meaningfully talk talking about differences of curvature and then your then your problem just scales you know you just keep adding more eigenvectors into your Hessian and it’s sort of like it’s one of those cases where I really do think you get a lot from that from the low dimensional case and in terms of minima well you know a lot for a long time people thought you

know that the neural net surface must be really really crazy and it is if the network is badly initialized and it’s very deep and it’s called it can develop all sorts of pathologies but if you actually look at the lost surface of a resident for example or one of these Nets that I’ve talked about to a lesser extent anyway it’s actually quite good and the theory for that’s emerging now from the sort of this neural nets as GPS literature which is basically looking at how what happens when you make the layers infinitely wide is that they actually predict if you start close to a good initialization that locally in fact there is a quadratic looking ball with a minimum and that minimum is the global minimum it gets zero error and it’s not so hard to you know see that because why it would be zero error because you know of course if you’ve got infinitely wide layers well of course of course they can get zero error right they can just memorize the problem yep yeah there can be so so mini batch sizes as you get as it gets bigger we’ll make the stashed a stochastic gradient you know better estimated so in other words the variance goes down and when you’re optimizing you know the reason that you in stochastic methods the reason you decay the learning rate or you use polyak averaging is to sort of cancel out this variance and so if the variance is lower you don’t need to lower the learning rate as aggressively because there’s just less of it right so the one rule of thumb that people sometimes use is you know you they’re that they’re inversely related so you know you if you double the batch size you might be able to double the learning rate but that’s not always going to be true I mean it’s certainly not true of a deterministic method that you can go you know your learning rate can be infinity right which is what it would be if you just keep doubling the batch size and applying that rule naively but that rule of thumb holds and sort of a certain range of learning rates in batch sizes yeah that’s that’s that’s good insight and people have developed algorithms that do that that you know one way you can view it is you know you’re you’re sampling random cases in the mini batch might not actually be the statistically smartest way to get grading estimates right I mean it’s unbiased but it’s not necessarily the lowest variance for example and and yeah there and there has been algorithms and and corresponding theory to describe that or variations on what you’ve just talked about I would say those algorithms are not used much in practice partly because it’s actually kind of awkward to do that that kind of data processing because you have to look at sort of how those data points are hand you know what the model thinks about them so you have to run evaluations and the data pipelines the way that they’re written is that you know they’re they’re always like pre loading data and pre-processing it with all these threads in the background and so adding is too much complexity to that can can be detrimental I think it’s an under explored area though there might be potentials for a big gain there but it’s so for example the stochastic gradient you know theory that I gave right it wouldn’t apply in that situation because in principle well at least it applies insofar as if you could phrase what you’re doing is a variance reduction technique then it would apply but you know if you if you’re able to get magically some gradient that might actually have zero variance you know let’s say because you just took the whole data set right then all that Theory stops immediately working but yeah engineering wise it’s tricky which is why we don’t typically do those we don’t we haven’t explored those methods that much anyway I’ve been told that we should stop but if you want to if you have any more questions you can just

come down here I’ll be available for the next twenty minutes you you