Chapter 4

June 10, 2025

Fast-AI Fastbook chapter 4, MNIST basics

Lets look under the hood and see exactly what is going in with a relatively simple deep learning model.

How are images represented in a computer?

“Unfortunately” we cannot just feed an image as is to a computer. To illustrate this, lets take a model as example that can classify an image as either the number 3, or the number 7. This is done using MNIST as data. All images are grayscale (to make it easier)

While images are usually rendered as colored pixels, under the hood it is all just numbers. In essence, an image is just an array of numbers between 0 and 255 indicating how black a pixel is. 0 is white, 255 is the darkest black. Every combination of (x,y) represents a single pixel. In this case, 28x28, which is insanely small compared to modern standards, but makes it easy to learn.

So how can we create a model that distinguishes between 3’s and 7’s? Let’s start with a simpel baseline model, and then we slowly start introducing features until we have a full blown Neural Network.

Baseline: Pixel similarity

So before we start fiddling with neural networks, let’s start with a really naive implementation.

With our entire collection of known 3’s (and 7s), we determine the average grayscale value for each pixel, defining the “ideal” 3 and 7. Then, for any given image, we check if it is closer to the ideal 3 or to the ideal 7. Given the fact that we have loaded in our data into a list of tensors with dimensions ([#, 28, 28]), average is as simple as mean3 = list_of_threes.mean(0) The 0 indicates that we want to average over the zero’th dimension (which represents the amount of images). Then you can do the same for the sevens: mean7 = list_of_sevens.mean(0) These “optimal” numbers look like the following:

Very blurry, which makes sense, since this image is basically every image of that number pasted on top of each other.

So now we need to think about how we are gonna determine the distance between any image and our “ideal” images. Lets start with 2 simple measures:

Mean Absolute Difference (L1 norm) (image - mean3).abs().mean()
Root Mean Squared Error (L2 norm) ((image - mean3)**2).mean().sqrt()

The difference between the two is that because of the squaring bigger mistakes are penalized more than smaller mistakes. Ok, so how good is our baseline model? Recall that to know is as a human, we need to define a metric. And to calculate this metric using a validation set, NOT data the model has already seen. Let’s skip the creation of a validation set, just know it contains 2000+ images of 3’s and 7’s.

Lets take the MAE and build it into a function that can operate on lists of tensors:

def mnist_distance(a, b):
    return (a - b).abs().mean(-1, -2)

So a couple of questions. Firstly, how is this able to calculate the distance for each entry in the 2000+ list of validation images? Do we have to write a loop? Nope. This uses what Pytorch calls Broadcasting. Whenever a simple mathematical operation is done on 2 (or more) tensors pytorch automatically expands the smaller tensor to be the same size as the bigger one. And when both tensors are the same size, it just performs the given operation on each element. So [1000, 28, 28] - [28, 28]) returns a [1000, 28, 28] tensor containing the differences for each pixel in each image. It then calls abs() on each individual pixel result, giving us 1000 matrices of 28x28 absolute values.

Alright, so far so good. But why do we have (-1, -2) inside the call to mean? The (-1, -2) represents a range of axes. In Python, -1 refers to the last element, and -2 refers to the second-to-last element. So in this case this tells Pytorch we want to average over the last two axes of the tensor (i.e. the horizontal and vertical dimensions of the image). This results in our final output tensor of the mnist_distance function to be [1000], a single value per image that represents the average intensity of all the pixels in that image.

Having this we can define a function that tells us if a given image is a 3 or not: def is_3(x): return mnist_distance(x, mean3) < mnist_distance(x, mean7) Thanks to the previous mentioned broadcasting, we can stuff our entire [1000, 28, 28] validation images tensor in it to recieve a [True, True, False, ...] tensor of 1000 elements. These Booleans can be cast to a float (True being a 1, False being a 0) and we can calculate the average accuracy): acc = is_3(validation).float().mean()

On the validation set accompanying this chapter we already get an accuracy of 95.11 %. But, there is no way to improve this system, and it only differentiates between 3’s and 7’s. What if we find a whole bunch of images of which we want to know if they contain 8’s? Lets see if we can introduce some traning.

Stochastic Gradient Descent (SGD)

So, we briefly mentioned stochastic gradient descent before, but lets now dive a bit deeper into how it actually is implemented using a relatively simple example. In the Pixel similarity approach earlier, we did not really have any weights that we could assign values to and improve over iterations. So lets switch up the approach a bit. How about instead of trying to find similarity over the entire image, what if we looked at each individual pixel and come up with a set of weights for each one, such that the highest weights are associated with that pixel being black for a particular number. Think of it like this: Given the numbers 7 and 8. Pixels in the lower left and right corner are generally black in an 8, while they are white (not used) in a seven. If we give those pixels high weight values for an 8 and low weight values for a 7, we can differentiate between the two, while also making the pixels “learn” with each example.

This can be represented by the following function: probability_8 = (X * W).sum() where we assume that X is the vector of the image, and W is the vector of the weights. So basically, all we need to now do is to find the values of the vector W that result in a high probability for images that are actual 8’s, while giving a low probability for images that are not 8’s. But how do we turn this relatively simple function into a neural network? Funny thing is, there are 7 steps that are at the basis of ANY Machine learning problem (Yes, also Neural Networks). We will specialize them a bit for this example, but in general they are all that is to it:

Initialize the weights
For each image, use these weights to predict whether the image is a 3 or a 7
Based on these predictions, calculate how good the model is (its loss)
Calculate the gradient, which measures for each weight, how changing that weight would change the loss
Step (change) all the weights based on the gradient calculation
Go back to step 2
Iterate until you no longer want to.

That is it

Allrights, lets dive deeper. Step 1 is easy. While you could initialize the weights to be a value that makes sense, starting with random values is perfectly fine. They get updated anyway. Step 2 is just feeding the model an image and see what its output is. Step 3 we already talked about quite a bit in the previous example, calculating how good the model is using a defined loss function. We are now gonna dive into Step 4, calculating the gradients.

Gradients (or the only thing that was really usefull in high school math)

So, you remember from high school that gradients are nothing less than the slope of a particular function at a certain point. And to do that, you calculate its derivative. This combined with the fact that for any given loss function, we want to minimize it, you can see how calculating the slope at a given point gives us information about which direction we need to move. But, there are some tricky parts to it (which really are not that tricky once you get to know them):

But we do not know the exact loss function? How can we calculate the derivative of a loss function we do not know?

Good question, we do not know the actual loss function, all we know is a given x and a resulting y (in 2d space for simplicity’s sake). But if we slightly change the x we get a different y. And by comparing the two y outputs, we can still calculate the slope, even without knowing the actual function that is behind it.
But we have a metric fuckton of weights?

Correct. Even this simple example has close to 1000 weights. To calculate the gradient for a single weight, we change that weight slightly while keeping all other weights constant. Sounds like a perfect job for a GPU.

Do we have to do this all by hand?

Thank god no, why do you think we use libraries in the first place?

So how do we do it. Within pytorch it is as simple as xt = tensor([3., 4., 10.]).requires_grad_() requires_grad_() tells pytorch that we want to calculate gradients with regards to these values. It’s like a tag, telling pytorch “Hey, whenever we tell you, calculate the gradients for all these values”. Calculating the gradients of all weights is usually called BackPropagation as you move backwards through the model, starting at the outputs towards the first layer.

Allright, so these gradients tell us the slope of the function, but they do not tell us directly how far to adjust the parameters. However it does give us some indication. Having a large slope usually means we are far off from the minimum, while a smaller slope indicates that we are already almost flat. However, the weights are almost never changed with the full amount that the slope indicates. It is usually multiplied with a relatively small number (0.001 - 0.1) called the learning rate. This is done so you do not change the weights too quickly, as that can lead to you never finding a minimum. Rather spend some extra epochs finetuning your weights than missing the minimum completely:

An end-to-end example of SGD

Lets assume we measured the speed of a rollercoaster. Measuring is never perfect, so the measurements end up looking like this: Us humans can ofcourse immediately see whats up, but lets train a model that reflects the true speed. Lets also assume its a quadratic function:

def f (t, params):
    a, b, c = params
    return a * (t**2) + (b * t) + c

We separate the input from the parameters, since the parameters are the values we are trying to estimate, the input is just the input. So we need to find a, b, c that best reflect our measurements. Let’s define a loss function:

def mse(preds, target):
    return ((preds - target)**2).mean()

We have seen this loss function before. Lets work through the 7 basic steps: Initialize the Parameters params = torch.randn(3).requires_grad_() Calculate predictions preds = f(time, params) Calculate the loss loss = mse(preds, speed) Calculate the gradients loss.backward() Step the weights lr = 1e-5 params.data -= lr * params.grad.data params.grad = None Repeat Doing this 10 times results in the following losses:

5435.53662109375
1577.4495849609375
847.3780517578125
709.22265625
683.0757446289062
678.12451171875
677.1839599609375
677.0025024414062
676.96435546875
676.9537353515625

Keep in mind, the end result is not to get the loss to 0. The result is to get the loss as low as is deemed appropriate.

Applying SGD to a less toy-ish example

Lets apply the above to the MNIST 28*28 images, since we were still trying if we could get a better result than the baseline pixel similarity solution. In that example we already did some data-wrangling on all the images, so we are just gonna continue from there. We are gonna slightly adjust the lists, since we are gonna approximate the following function: y = w * x + b we need the list to be a rank-2 tensor (so instead of a [x, 28, 28] tensor we want a [x, 784] tensor) where each image is a rank-1 tensor [784].

So lets initialize the weights to random values: weights = init_params(28 * 28) and lets initialize the bias as well bias = init_params(1)

Lets calculate the prediction of a single image: (train_x[0] * weights.T).sum() + bias The T indicates that it’s the transpose of the weights, but we don’t have to care much about since, we are never doing this by hand again. So how do we calculate the predictions for every image? a loop? fuck no. We use matrix multiplication. Since our weights are a [784, 1] matrix and our images are a [1000, 784] matrix, we can multiply the two together to get a prediction for each of the 1000 “rows” in the images tensor. def linear(xb): return xb@weights + bias This gives us a [1000, 1] tensor of predictions looking like:

[20.2336],
[17.0644],
[15.2384],
...,
[18.3804],
[23.8567],
[28.6816]]

Allright, so now we have our predictions, it is time to pick / create a loss function. So which one do we take. Accuracy is not an option. Since this is binary classification (either a 3 or a 7), if we change a weight very little, there is no change in the output. The output only flips once at the 50% mark. Above and below that, change is non-existant.

One way to remedy this is to scale all our predictions to be between 0 and 1, and with a smooth curve connecting the two. That way a small change in weights would be reflected by the prediction being slightly more in the direction of either 0 or 1. For that we can transform the predictions using the sigmoid function: def sigmoid(x): return 1 / (1 + torch.exp(-x))

Using this, we can define a loss function that scales all the predictions to be between 0 and 1, and allows us to accurately portray how confident a certain prediction is:

def mnist_loss(predictions, targets):
    predictions = predictions.sigmoid()
    return torch.where(targets == 1, 1 - predictions, predictions).mean()

Now that we have a loss function, it is time for the next step, which is to change or update the weights based on the gradients. Optimization this is called.

Up until now we have just been calculating the loss per item, giving us a long list of numbers. But how do we go from that list to updating the actual set of weights, since those are reflective of the entire model. We could calculate the loss for every item in the entire dataset, and average that value. That’s fun if our dataset contains a 1000 items, but with datasets containing millions of inputs that will take a very long time. So we need to find some middle ground that is both fast enough, as well as accurate enough. That compromise is to calculate the average loss for “a few” data items at a time. This is called a mini batch. The number of data items in a mini batch is called the batch size. A larger batch size means more accurate estimates of the gradients from the loss function, at the cost of increased time. Choosing a proper batch size can therefor have quite an impact on the performance of a model. Another smart thing (that is connected to mini batches and is done by default by many frameworks) is to shuffle the contents of each batch around at the start of each epoch. That way you ensure that the model does not see the same batches with the same content every single epoch, leading to better generalization by your model. In Pytorch and fast.ai this is done using the DataLoader object which take a dataset as input (a dataset is a list of tuples containing inputs and targets of the model): so given a dataset:

(#26) [(0, 'a'),(1, 'b'),(2, 'c'),(3, 'd'),(4, 'e'),(5, 'f'),(6, 'g'),(7, 'h'),(8, 'i'),(9, 'j')...]

inserting it into a DataLoader: dl = DataLoader(ds, batch_size = 6, shuffle = True) returns:

[(tensor([17, 18, 10, 22,  8, 14]), ('r', 's', 'k', 'w', 'i', 'o')),
 (tensor([20, 15,  9, 13, 21, 12]), ('u', 'p', 'j', 'n', 'v', 'm')),
 (tensor([ 7, 25,  6,  5, 11, 23]), ('h', 'z', 'g', 'f', 'l', 'x')),
 (tensor([ 1,  3,  0, 24, 19, 16]), ('b', 'd', 'a', 'y', 't', 'q')),
 (tensor([2, 4]), ('c', 'e'))]

Putting it all together

So now we have everything we need to write a manual training loop. Keep in mind that in the actual libraries, a lot of this is abstracted and optimized, but its good to see how it would be written by hand at least once:

In pseudocode-ish

for x, y = dl:
	pred = model(x)
	loss = loss_func(pred, y)
	loss.backward()
	parameters -= parameters.grad * lr
	parameters.grad = None

First we initialize the weights

weights = init_params(28 * 28)
bias = init_params(1)

Then we create a dataloader from the dataset for both the training and the validation set:

dl = DataLoader(dset, batch_size = 256)
valid_dl = DataLoader(valid_dset, batch_size = 256)

Lets create a function to calculate the gradients:

def calc_grad(xb, yb, model):
	preds = model(xb)
	loss = mnist_loss(preds, yb)
	loss.backward()

And create a function to train the model for 1 epoch (1 pass over all the training data):

def train_epoch(model, lr, params):
	for xb, yb = dl # so for each batch
		calc_grad(xb, yb, model)
		for p in params:
			p.data -= p.grad * lr
			p.grad_zero_()

We do the same to allow us to calculate the accuracy so us humans know whats up with the model (again per batch)

def batch_accuracy(xb, yb):
	preds = xb.sigmoid()
	correct = (preds > 0.5) == yb
	return correct.float().mean()

and make a function to calculate the accuracy over an entire epoch:

def validate_epoch(model):
	accs = [batch_accuracy(model(xb), yb) for xb, yb = valid_dl]
	return round(torch.stack(accs).mean().item(), 4)

And lets train our model for 20 epochs:

lr = 1.
params = weights, bias
for i in range(20):
	train_epoch(linear, lr, params) # remember our linear function from wayback
	print(validate_epoch(linear), end = ' ')

0.8314 0.9017 0.9227 0.9349 0.9438 0.9501 0.9535 0.9564 0.9594 0.9618 0.9613 0.9638 0.9643 0.9652 0.9662 0.9677 0.9687 0.9691 0.9691 0.9696

We are already at the same level as our pixel similarity approach, but this model can be improved in many different ways, and is immensely flexible.

So now that we have done it all relatively manual, let’s go stepwise into the abstractions until we end up on the level that you would normally use.

First we replace the initialization of the parameters and the model with a pytorch module. So instead of

weights = init_params(28 * 28)
bias = init_params(1)

And

def linear(xb): return xb@weights + bias

We can use:

linear_model = nn.Linear(28*28, 1)

Secondly we can create an optimizer using the above which takes care op the stepping and zero’ing of the gradients. Fast.ai provides this in the SGD class (let’s skip the building this by hand). The above two things now turn our training loop into:

linear_model = nn.Linear(28*28, 1)
opt = SGD(linear_model.parameters(), lr)
train_model(linear_model, 20)

Fast.ai also provides Learner.fit which abstracts the whole train_model function away. It requires us to create a DataLoaders object, which is basically both the training as well as the validation DataLoader:

dls = DataLoaders(dl, valid_dl)

which gives us the ability to turn our training loop into:

learn = Learner(dls, nn.Linear(28*28, 1), opt_func = SGD,
   		loss_func = mnist_loss, metrics = batch_accuracy)
   		
learn.fit(10, lr = lr)

So why go through all this trouble of abstractions and stuff? Because with all these abstraction layers, we can turn our linear model into a full blown Neural Network!

Adding a non-linearity (or how to make insanely simple things sounds insanely complicated)

Allright, so we could just add multiple linear layers together and call it a day, like:

nn_model = nn.Sequential(
nn.Linear(28*28, 30),
nn.Linear(30, 12),
nn.Linear(12, 1))

But if you have had a single high school math lesson, you can spot the flaw with this. Adding three linear things together Is the same as having a single linear thing with different parameters. This is just a waste of resources.

Mathematically, we say the composition of two linear functions is another linear function. So, we can stack as many linear classifiers as we want on top of each other, and without nonlinear functions between them, it will just be the same as one linear classifier.

Things change if we put something non-linear in between each linear layer however. If we put a Rectified Linear Unit (ReLU) in between each layer that linearity problem goes away.

If we put a nonlinear function between each linear function, such as a ReLU, the above is no longer true. Now each linear layer is actually somewhat decoupled from the other ones, and can do its own useful work.

So this ReLU must be super complicated? Wrong! Here is a ReLU mixed with 2 linear layers:

simple_net = nn.Sequential(
	nn.Linear(28*28, 30),
	nn.ReLU(),
	nn.Linear(30, 1)
)

This ReLU (and there are different versions of ReLU’s, but they are all easier than they sound) basically says: Replace every negative number with 0.

And this simple net can once again be used to create a learner and to train a model:

learn = Learner(dls, simple_net, opt_func = SGD,
	loss_func = mnist_loss, metrics = batch_accuracy)

learn.fit(40, 0.1)

So now we have an infinitely flexible function that we can pretty much use for everything.

Now for the final deep dive

So in essence a single non-linearity with 2 linear layers can approximate ANY function. So why go deeper? Performance. With the amount of Computer Power we have at our disposal nowadays, a deep model allows us to use less parameters and smaller layers, resulting in a quicker learning loop. Remember that everything we did by hand had a roughly 95% accuracy. Lets now train an 18 layer model on the same data:

dls = ImageDataLoaders.from_folder(path)

learn = vision_learner(dls, resnet18, pretrained = False,
	loss_func = F.cross_entropy, metrics = accuracy)
	
learn.fit(1, 0.1)

This results (after 1 epoch) in an accuracy of 99.7%.

Questionaire

How is a grayscale image represented on a computer? How about a color image? A. Grayscale images are represented by matrices of numbers, one cell for each pixel. The value ranges from 0 (white) to 255 (black). A color image is a matrix of tuples of numbers, once again one cell per pixel. but instead of a single 0-255 value, it consists of 3 numbers (0-255) indicating the amount of Red, Green, and Blue.
How are the files and folders in the MNIST_SAMPLE dataset structured? And Why? A. They are divided into a training and validation set, with a directory per digit filled with images of that digit. This is pretty much the standard way of dividing things, since the developer in question does not have to think about how to divide images up.
Explain how the “pixel similarity” approach to classify digits works A. The pixel similarity approach worked by creating the average number out of all the images of that particular number. This average number could then be compared with an image to see how much alike they were. Although a naive implementation, it is not all a bad one, getting over 90% accuracy in differentiating between 3’s and 7’s. However it would become complicated if we would like to differentiate between multiple categories and there is no room for improvement.
What is a list comprehension? Create one that selects odd numbers from a list and doubles them A. Python list comprehensions are basically syntactic sugar to loop over a list and to perform an action on each element, resulting in a new list. think of it as a Map function [x * 2 for x in list if x % 2 != 0]
What is a Rank 3 tensor? A. A rank 3 tensor is a tensor with 3 ranks, or 3 dimensions. A tensor’s rank is the number of axes in a tensor. The shape is the length of each axis.
What is the difference between tensor rank and shape? How do you get the rank from the shape? A. The length of a tensor’s shape, is its rank. Example: a tensor’s shape can be [6131, 28, 28] indicating 6k+ images of 2828 pixels. The fact that there are 3 values in its shape defines it as a rank 3 tensor.
What are RMSE and L1 Norm? A. They can both be used as loss functions. RMSE is the Root-Mean-Squared-Error. So you take the difference between target and predicted, square it, average it, and take the square root of it. The L1 Norm is the Mean Absolute Difference, where you take the absolute value of the difference and you average it.
How can you apply a calculation on thousands of numbers at once, many thousands of times faster than a python loop? A. A combination of Raw GPU power, broadcasting, hyper-efficient C code, and matrix multiplication.
Create a 3x3 tensor or array containing the numbers 1 to 9. Double it, select the bottom-right four numbers A. t = tensor([1, 2, 3], [4, 5, 6], [7, 8, 9]) t = t * 2 t[1:3, 1:3]
What is broadcasting? A. Broadcasting is a technique with which you enable tensor mathematics between tensors of different lengths. It automatically expands the smaller tensor to be of the same size as then one with the larger rank. This then in turn allows very fast mathematical operations
Are metrics generally calculated using the training set, or the validation set? Why? A. Metrics are generally calculated using a validation set. Calculating how good a model is performing using data it has already seen has no use, since the model encompasses that data already. You want to know how good a model is with data it has never seen before.
What is SGD? A. SGD stands for Stochastic Gradient Descent and its the main way in which Neural Networks “learn”. By calculating gradients for the weights after a model prediction, we can shift the model in such a way that it becomes slightly better with each iteration.
Why does SGD use mini-batches? A. The reason SGD uses mini-batches is a trade off between speed en accuracy. You can use the entire data-set with each iteration to calculate the gradients using SGD, but that is an insane amount of calculations, which can slow the process down immensely. On the other end, using only a single image at a time gives little info. The concession is to meet somewhere in the middle.
What are the seven steps in SGD for machine learning?
1. Initialize the weights
2. For each image, use these weights to predict whether the image is a 3 or a 7
3. Based on these predictions, calculate how good the model is (its loss)
4. Calculate the gradient, which measures for each weight, how changing that weight would change the loss
5. Step (change) all the weights based on the gradient calculation
6. Go back to step 2
7. Iterate until you no longer want to.
How do we initialize the weights in a model? A. We give them random values. We can opt to give them somewhat reasonable values, but they get updated to shift into the correct direction anyway, so random is also perfectly fine.
What is loss? A. Loss is a measure of how good the model is in its predictions. It is similiar to the accuracy of a model, but the loss is usually more optimizable by machines, and less readable than the accuracy.
Why can’t we always use a high learning rate? A. The learning rate is used to scale the moving of the values of the weights. If we use a too high learning rate, we run the risk of overshooting our estimation of the minimum of the loss function, causing us to be unable to find it, see the following image:
What is a “gradient”? A. The gradient is the value of the slope of the loss function at a specific point (the value of the weight you are calculating the gradient of).
Do you need to know how to calculate gradients yourself? A. Not really, but it does not hurt to know how it is done on a superficial level.
Why can’t we use accuracy as a loss function? A. Already explained, but in general, accuracy is used as a human-readable form to express the performance of a model, but it is often hard to optimize by SGD, as it is often a very “black-and-white” view, with not a lot of room for small changes to be visible.
Draw the sigmoid function. What is special about its shape? A. The sigmoid function is always between 0 and 1 for any given input. It is used to scale NN predictions in such a way that they can be meaningfully used for validation.
What is the difference between a loss function and a metric? A. The loss function is something that we strive to minimize. Metrics are just “helpers” to let us check how good of a job we are actually doing at reaching that goal
What is the function to calculate new weights using the learning rate? A. weights -= gradient * learning rate
What does the DataLoader class do? A. A DataLoader gives us mini-batches of data whenever we give it our entire data-set.
Write pseudo-code showing the basic steps taken in each epoch of SGD A. Initialize the Parameters params = torch.randn(3).requires_grad_() Calculate predictions preds = f(time, params) Calculate the loss loss = mse(preds, speed) Calculate the gradients loss.backward() Step the weights lr = 1e-5 params.data -= lr * params.grad.data params.grad = None Repeat
Create a function that, if passed two arguments [1,2,3,4] and 'abcd' returns [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]. What is special about that output data structure? A. No need to write it, this is basically the zip function. And this is special because this is the way Pytorch expects our data to be in in order to be used properly in Machine Learning.
What does view do in Pytorch? A. It returns a new tensor with the same data as the self tensor but of a different shape.
What are the bias parameters in a Neural Network? Why do we need them? A. Because they are part of the definition of a linear function y = ax + b
What does the @ operator do in Python? A. It does matrix multiplication very efficiently.
What does the backward method do? A. It calculates the gradients for all the weights that gradients should be calculated for.
Why do we have to zero the gradients? A. The gradients actually have to be zero’d because when you call .backward() it actually adds the gradient of the loss to any existing gradients.
What information do we have to pass to the Learner? A. We add the DataLoaders which ensure our data-sets get batched up, we add the model to be able to predict. We add the optimization function which often is just SGD. Finally we add the loss function and the metrics.

Show Python or pseudo-code for the basic steps of the training loop A.

for x,y in dl:
pred = model(x)
loss = loss_func(pred, y)
loss.backward()
parameters -= parameters.grad * lr

What is a ReLU? Draw a plot. A. ReLU stands for Rectified Linear Unit, it basically means replace all negative numbers with 0. Showing the plot is trivial, its 0 for all negative numbers, and then x = y.
What is an Activation Function? A. The activation function in a model is the non-linearity. In simple cases that is the ReLU.
What is the difference between F.relu and nn.ReLU? A. Mathematically there is no difference, but the nn.ReLU is a PyTorch module, which makes it usable for the main Pytorch Neural Network flow.
The Universal Approximation Theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more? A. Using multiple nonlinearities allows the usage of less parameters to model the same complexity, leading to better performance. This increase in performance can thus also be used to model higher complexity.