Neural Network Sketch
Alright so we’re now in a great position to talk about what the network part of the neural network is about. So now the idea is that we can construct using exactly these kind of sigmoid units, a chain of relationships between the input layer, which are the different components of x, with the output. Y, and the way this is going to happen is, there’s u, other layers of, of units in between. That each one is computing the weighted sum, signoided, of the layer before it. These other layers of units are often referred to as hidden layers, because you can kind of see the inputs, you can see the outputs. This, this other stuff is, is less constrained. Or indirectly constrained. And what’s happening is that each of these units, it’s, it’s running exactly that kind of, you know, take the weights, multiply by the things coming into it, put it through the sigmoid and that’s your activation, that’s your output. So, so what’s cool about this is, in the case where all these are sigmoid units this mapping from input to output. Is differentiable in terms of the weights, and by saying the whole thing is differentiable, what I’m saying is that we can figure out for any given weight in the network how moving it up or down a little bit is going to change the mapping from inputs to outputs. So we can move all those weights in the direction of producing something more like the output that we want. Even though that there’s all these sort of crazy non linearities in between. And so, this leads to an idea called back propagation, which is really just at its heart, a computationally beneficial organization of the chain rule. We’re just computing the derivatives with respect to all the different weights in the network, all in one convenient way, that has, this, this lovely interpretation of having information flowing from the inputs to the outputs. And then error information flowing back from the outputs towards the inputs, and that tells you how to compute all the derivatives. And then, therefore how to make all the weight updates to make, the network produce something more like what you wanted it to produce. So this is where learning is actually taking place, and it’s really neat! You know, this back propagation is referring to the fact that the errors are flowing backwards. Sometimes it is even called error back propagation.>>Nice, so here’s a question for you Michael. What happens if I replace the sigmoid units with some other function and, and let’s say that function is also different Well, if it’s differentiable, then we can still do this, this basic kind of trick that says we can compute derivatives, and therefore we can move weights around to try to get the network to produce what we want it to produce.>>Hmm. That’s a big win. Does it still act like a preceptron?>>Well, even this doesn’t act exactly like a preceptron, right? So it’s really just analogous to a preceptron, because we’re not really doing the hard thresholding, we don’t have guarantees of, of convergence in finite time. In fact, the error function can have many local optima, and what, what we mean by that is this idea that we’re trying to get the, we’re trying to set the weight so that the error is low, but you can get to these situations where none of the weights can really change without making the error worse. And you’d like to think, well good, then we’re done, we’ve made the error as low as we can make it, but in fact it could actually just be stuck in a local optima, that there’s a much better way of setting the weights It’s just we have to change more than just one weight at a time to get there.>>Oh so that makes sense, so if we think about sigmoid the sigmoid and the error function that we picked right. The error function was sum of squared airs, so that looks like a porabola in some high dimensional space, but once we start combining them with others like this over, over, and over again Then we have an error space where there may be lots of places that look low but only look low if you’re standing there but globally would not be the lowest point.>>Right, exactly right and so you can get these situations in just the one unit version where the error function as you said is this nice little parabola and you can move down the gradient and when you get down to the bottom you’re done. But now when we start throwing these networks of units together we can get an error surface that looks just in its cartoon form looks crazy like this, that there’s, it’s smooth but there’s these Place where it goes down, comes up again and goes down maybe further, comes up again and doesn’t come down as far and you could easily get yourself stuck at a point like this where you’re not at the global minimum. Your at some local optimum.