At the end of this lesson, students are expected to:
<aside> β οΈ Math-heavy disclaimer β this module contains a lot of derivations. What we want you to take away from this module is to be able to apply backpropagation to update network weights. We have shown every step in how to derive the update equations. You should be able to read and follow along with the derivation, but you wonβt be expected to derive it yourself. You should know all the numbered equations, and understand the update equations and be able to apply them.
</aside>
In this lesson, we will discuss how to train a neural network. In 4.2 Neural Network Basics we learned about how expressive neural networks can be, but we also learned that they can have a lot of parameters. Really, a lot β for example, this research paper designs a 1.6 trillion parameter network! How does one go about training a model with so many parameters?
As we saw in ML & Optimization , there are a number of alternatives to gradient-based optimization. For example, we used SMO (a type of sequential quadratic programming) to optimize SVMs. Evolutionary algorithms are a whole other family of techniques we havenβt covered in this course. While these optimization techniques can get excellent results for certain types of problems, they do not scale well with the βsizeβ of the problem β i.e. if the number of parameters they need to find is too large, they do not perform as well.
In fact, naively applying gradient-based methods would lead us to the same problem. If we tried to train a neural network with tens or hundreds of millions of parameters by directly computing the gradient with respect to every one of those parameters individually, it would be extremely inefficient.
Nevertheless, SGD is the algorithm of choice to optimize neural networks. But, to apply it to neural networks, we need a more efficient way to compute the gradients. That is where backpropagation comes in. Backpropagation works by computing the gradient of the loss function w.r.t. each weight using the chain rule, computed one layer at a time. It starts by computing the gradient of the last layer, and then iterates backwards towards the front of the network to avoid redundant calculations.