4.3 How Neural Networks Learn

🎓 Intended learning outcomes

At the end of this lesson, students are expected to:

Understand how backpropagation makes SGD more efficient
Be able to apply the chain rule to compute gradients in a neural network
Be able to derive the general formulation for the gradient of weights and biases in a neural network
Be able to apply the backpropagation equations to update weights of a neural network

<aside> ⚠️ Math-heavy disclaimer — this module contains a lot of derivations. What we want you to take away from this module is to be able to apply backpropagation to update network weights. We have shown every step in how to derive the update equations. You should be able to read and follow along with the derivation, but you won’t be expected to derive it yourself. You should know all the numbered equations, and understand the update equations and be able to apply them.

</aside>

In this lesson, we will discuss how to train a neural network. In 4.2 Neural Network Basics we learned about how expressive neural networks can be, but we also learned that they can have a lot of parameters. Really, a lot – for example, this research paper designs a 1.6 trillion parameter network! How does one go about training a model with so many parameters?

❌ Which methods do not work? ★☆☆

As we saw in ML & Optimization , there are a number of alternatives to gradient-based optimization. For example, we used SMO (a type of sequential quadratic programming) to optimize SVMs. Evolutionary algorithms are a whole other family of techniques we haven’t covered in this course. While these optimization techniques can get excellent results for certain types of problems, they do not scale well with the ‘size’ of the problem — i.e. if the number of parameters they need to find is too large, they do not perform as well.

In fact, naively applying gradient-based methods would lead us to the same problem. If we tried to train a neural network with tens or hundreds of millions of parameters by directly computing the gradient with respect to every one of those parameters individually, it would be extremely inefficient.

Nevertheless, SGD is the algorithm of choice to optimize neural networks. But, to apply it to neural networks, we need a more efficient way to compute the gradients. That is where backpropagation comes in. Backpropagation works by computing the gradient of the loss function w.r.t. each weight using the chain rule, computed one layer at a time. It starts by computing the gradient of the last layer, and then iterates backwards towards the front of the network to avoid redundant calculations.

Table of Contents

🎓 Intended learning outcomes

❌ Which methods do not work? ★☆☆

🌇 Setting up the problem ★★☆