Chapter 5
Fast-AI Fastbook chapter 5, Deep dive Image Classification
Allright, now that we know how to run and create basic models, let’s dive a bit deeper into how pytorch and fast.ai work under the hood. This will encompass the coming chapters, going from Image Classification to Natural Language Processing to really low level architectural improvements.
But lets start relatively simple, by expanding the image classification example to go from binomial (2 choices) to multinomial (X amount of choices).
From dogs vs cats to individual pet breeds
In the first chapter we built a simple yet powerful model to classify if a picture is a cat or a dog. Let’s see if we can build a model that can identify the actual species of the cat / dog we present it. We will be skipping the data preparation shown in this chapter in this summary. While it is an essential part of building awesome models, it does not really translate well to a written summary like this, plus it is something that I already am fairly handy in.
So given we have extracted all the data, and have it in such a shape that we can work with it, we create a DataBlock. Remember, the DataBlock API is a high level API to help us get our data into the DataLoaders. It can be seen as the blueprint of a pipeline, which lays out how to handle the data that you are going to give it. An Example:
pets = DataBlock(blocks = (ImageBlock, CategoryBlock),
get_items = get_image_files,
splitter = RandomSplitter(seed = 42),
get_y = get_breed_from_name(),
item_tfms = Resize(460),
batch_tfms = aug_transforms(size = 224, min_scale = 0.75))
dls = pets.dataloaders(path/"images")
Presizing and Batch Transformations
In the previous code examples there were 2 relatively new lines:
item_tfms = Resize(460)
and
batch_tfms = aug_transforms(size = 224, min_scale = 0.75)
These 2 lines implement a fast.ai strategy called presizing. This minimizes data destruction while maintaining good performance.
As you might remember, images need to all be the same size in order to be collated into a single tensor and fed to the model. We also want to minimize the amount of calculations that we do, since we already do a metric fuckton of em. The challenge however is that transforming images after scaling them down to the size that they will be used at can lead to data degradation (empty zones, weird glitches, etc.)
The above image has been transformed after resizing and you can see some glitches in the lower left and right corners.
To combat this, fastai presizing uses 2 strategies:
- Resize the image to a relatively large dimension, certainly larger than the target training dimensions.
- Compose all of the commen augmentation operations performed after (including a resize to the final size) into a single operation.

Debuggin the pipeline
Now that we have our DataBlock pipeline in place, and having it fed some data with the .dataloaders() command, it is always a smart idea to check if the data all ended up where it should have (correct labels, correct image transforms, etc).
Using dls.show_batch() you can look at a batch of the training data. If you want more information, calling pets.summary() (where pets is a DataBlock object) gives you the output of the entire pipeline as it tries and creates a batch for you. Its a lot of output, but it tells you exactly what and where it went wrong.
Once you think everything is allright, the next best thing to do is to train a simple baseline model. Don’t wait too long with training your first model!. People often try to get everything perfect before training their first model, but this approach leaves you without a baseline result. And often these baseline results can already be very good at what they need to do.
So what is a good loss function for a multi-nomial result?
For multinomial a good loss function is Cross-Entropy loss. This is basically an extension of the MNIST loss function of the previous chapter. Lets dig deep into what this loss function is and how it works.
Viewing activations and labels
In order to understand the loss function, it helps to check what the loss function actually sees in terms of data. We can retrieve one batch from the dataloaders with
x, y = dls.one_batch().
if we print out y we see its a list of 64 numbers between 0 and 37:
TensorCategory([ 0, 5, 23, 36, 5, 20, 29, 34, 33, 32, 31, 24, 12, 36, 8, 26, 30, 2, 12, 17, 7, 23, 12, 29, 21, 4, 35, 33, 0, 20, 26, 30, 3, 6, 36, 2, 17, 32, 11, 6, 3, 30, 5, 26, 26, 29, 7, 36, 31, 26, 26, 8, 13, 30, 11, 12, 36, 31, 34, 20, 15, 8, 8, 23], device='cuda:5')
The number represents the pet breed the model predicts. But how does the model predict this. The output of the last activation layer is not actually the number 0, but a list of predictions for each possible outcome. So with 37 pet breeds, the predictions of the model contain 37 numbers, which all sum up to 1:
preds = learn.get_preds(dl = [(x, y)])
preds[0]
tensor([9.9911e-01, 5.0433e-05, 3.7515e-07, 8.8590e-07, 8.1794e-05, 1.8991e-05, 9.9280e-06, 5.4656e-07, 6.7920e-06, 2.3486e-04, 3.7872e-04, 2.0796e-05, 4.0443e-07, 1.6933e-07, 2.0502e-07, 3.1354e-08, 9.4115e-08, 2.9782e-06, 2.0243e-07, 8.5262e-08, 1.0900e-07, 1.0175e-07, 4.4780e-09, 1.4285e-07, 1.0718e-07, 8.1411e-07, 3.6618e-07, 4.0950e-07, 3.8525e-08, 2.3660e-07, 5.3747e-08, 2.5448e-07, 6.5860e-08, 8.0937e-05, 2.7464e-07, 5.6760e-07, 1.5462e-08])
As you might have seen, the model’s predictions are relatively small, but it is the highest at the 0th index, thus its “final prediction” is 0.
But we also learned that activations of a model usually do not lie between 0 and 1, and they certainly not sum to 1. Last time we used the sigmoid function:
But this only worked for a binary output. Is there something similar for a multinomial output? The answer is yes, the softmax function. The sigmoid was defined by:
def sigmoid(x): return 1 / (1 + torch.exp(-x))
The softmax is defined by:
def softmax(x): return torch.exp(x) / (torch.exp(x).sum())
The sum() in the softmax kind of indicates that it takes all the other categories into account when calculating predictions. An example:
We have a tensor of random values
t = tensor([0.02, -2.49, 1.25])
Calling sigmoid on this tensor return:
tensor([0.5050, 0.076, 0.773])
While softmax returns:
tensor([0.2221, 0.0180, 0.7599])
The only downside of the softmax (and its only a downside in niche uses), is that it really wants to pick one class:
softmax(tensor([-1, -0.5])
returns
tensor([0.3775, 0.6225])
While there is only a relatively small difference between the 2 in the activation layer, after the softmax, the second prediction is almost twice as high. For classifiers this is intended behaviour, since we want the model to pick one. But in cases where “I don’t know” is a valid answer, you might not want this kind of behaviour.
Allright, softmax is only 1 part of the Cross-Entropy loss function. The second part is the log likelihood.
Log Likelihood
So now we have predictions of a model, all between 0 and 1, nice and fancy. But we still have no way to check if a model is wrong / right, and how confident it is in its predictions. That’s were the (Negative) Log Likelihood comes in. A little table where we go back to our 3’s and 7’s:

So here we have a table where the first 2 columns are the activations, the next column are our targets, and idx is the row index. The result column is the activation of the target (0 for 3, 1 for seven) at the given row index. In the table the row index seems weird, but it makes sense when the data is separate. So just as an example, in the third row, the model predicted 13% for a 3 and 87% for a 7, yet the image was of a 3. So the result is 0.13.
The “loss” column is basically the -log of the result. The higher this number is, the closer to 0 the result was. You can see that the largest number (5.69) corresponds with a very confident, but very wrong prediction that the image was a 3 (99%) while it actually was a 7. Wrong predictions that were less confident (row 5) give a much smaller (yet still high) number). And this is what our loss function should do right, be high when we are very wrong, be low when we are very right.
if we then take the mean of all our losses we have the Negative Log Likelihood Loss. The combination of Softmax first and Log Likelihood after is called Cross-Entropy Loss.
Ofcourse this is built it in Pytorch:
nn.CrossEntropyLoss(). Cool, so the computer now has a function it can nicely minimize. But how do we as humans check the output?
Model Interpretation
Alright, so lets see if we as humans can make some sense of the model. We humans understand accuracy, so lets look at it:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

A nice actual vs predicted plot. But since it is so big, I think we are better off asking the model where it made most of its mistakes:
interp.most_confused()
[('american_pit_bull_terrier', 'staffordshire_bull_terrier', 10),
('Ragdoll', 'Birman', 8),
('Siamese', 'Birman', 6),
('Bengal', 'Egyptian_Mau', 5),
('american_pit_bull_terrier', 'american_bulldog', 5)]
These breeds all seem very alike, so those errors are not that weird. Alright, this has given us a nice baseline model. Now we are gonna dive into fine-tuning some parameters, to see if we can improve this baseline.
1. The Learning Rate Finder
One of the most important things we can do when training is to make sure we have the correct learning rate. Our first model (not shown) had an error rate of 8% after 2 epochs with a default learning rate of 1e-3. If we take the exact same model, but up the learning rate to 0.1, we get an error rate of 84%. This means the optimizer overshot the minimum of the loss function, and kept going in the wrong direction.
So how do we find the best learning rate? Guessing? Rule of thumb? While those options might get you in the right direction, something relatively simple as a learning rate finder might help you better. The idea of this learning rate finder is extremely simple:
- Start with a really small learning rate, something almost so small it becomes useless, and use that learning rate for 1 mini-batch.
- Find the loss of that mini-batch
- Increase the learning rate by some percentage (double it, whatever)
- Keep doing this until the loss gets worse.
- Pick a learning rate 1 order of magnitude smaller than where the minimum loss was achieved or the last point where the loss was clearly improving.
This can be done with a simple command:
learn = vision_learner(dls, resnet34, metrics = error_rate)
learn.lr_find()
which returns a nice graph:
Which indicate somewhere around 1e-3 would be nice. Don’t be to hung up on trying to estimate the perfect learning rate from this graph however. When picking a learning rate, we care about magnitudes above all else.
2. Unfreezing and transfer learning
We already discussed briefly how transfer learning works, where we take an already pre-trained model and show it our data instead. But what does this really mean? We also know a CNN consists of many different linear, and non-linear layers, ending in an activation function. The final linear layer uses a matrix with as many columns as there are output categories (in the case of classification at least). This layer is not useful when we try to train a pre-trained model for a new case. But that does not mean the entire model is useless. All the layers prior to the final one have been trained on generic (early layers) and more specialized (later layers) image detection features.
All these concepts are still very useful for our new use case. It would be a shame to throw those all out. So we only toss out the final layer, and replace it with something that we can use instead. We then train the model for a couple of epochs, but we tell it to only update the weights in this final added layer. This is called freezing. After this we unfreeze the layers, and we train the entire model for the amount of epochs specified. This is the default behaviour done by learn.fine_tune(), which is very reasonable.
If you want some more fine-grained control however, you can. learn.fine_tune() uses a learn.fit_one_cycle() under the hood, and there is nothing stopping you from calling learn.fit_one_cycle() directly. Use learn.fine_tune?? and learn.fit_one_cycle?? to learn more about how to call them.
Here is part of the source code for learn.fine_tune():
def fine_tune(self:Learner, epochs, base_lr=2e-3, freeze_epochs=1, lr_mult=100,
pct_start=0.3, div=5.0, **kwargs):
"Fine tune with `Learner.freeze` for `freeze_epochs`, then with `Learner.unfreeze` for `epochs`, using discriminative LR."
self.freeze()
self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
base_lr /= 2
self.unfreeze()
self.fit_one_cycle(epochs, slice(base_lr/lr_mult, base_lr), pct_start=pct_start, div=div, **kwargs)
3. Discriminative Learning Rate
Discriminative learning rates is the concept of using different learning rates for different layers. This is also fast.ai’s default implementation. It works by using a lower learning rate for early layers, increasing the learning rate for the higher layers (the ones that recognize more complex concepts). The idea is that the early layers learn concepts that are highly applicable to many models, such as edges or gradients. These are already trained for 100’s of epochs (remember we use transfer learning here), so they are not as likely to get much better with big steps. Using a smaller learning rate, we can more precisely fine-tune these layers. The reverse is true for the later layers.
So instead of passing a single number as a learning rate to fast.ai, you can pass a slice.
learn = vision_learner(dls, resnet34, metrics=error_rate)
learn.fit_one_cycle(3, 3e-3)
learn.unfreeze()
learn.fit_one_cycle(12, lr_max=slice(1e-6,1e-4))
Fast.ai will make sure the learning rate range is spread over all the layers of the model equally.
4. The number of epochs
This is basically equivalent to “How much time do you have?” since time is most often the number 1 constraint. So at first you should pick a number of epochs that fits in the amount of time you have. After this you check your losses and your metrics. If they are still improving in your final epochs then you can be sure you have not trained to long.
On the other hand, if you see your loss and metrics actually getting worse after a while, it is a clear sign of overfitting. In practice you would only care about your metrics, since the loss is just a “function to optimize”, but it can be an indicator none the less. So longer training is not always the answer (but sometimes it is though, isnt it fun?). If you have the time to really not care about the number of epochs, you may want to spend that using more parameters instead.
5. Deeper Architecture
In general a model with more parameters can model more accurately. Creating larger models is as simple as adding more layers. But since we are often using pre-trained models, which come in a (small) variety of flavours. In general, a bigger model has the ability to better capture the real underlying relationships in your data, and also to capture and memorize the specific details of your individual images. but it also eats as much GPU power as google chrome eats RAM. So it pays to fiddle with larger models, but smaller batch-sizes.
One other thing that you can fiddle with in order to fit larger models into GPU RAM is using “less precise” number, so called fp16 or half precision floats. This often dramatically increases the speed of training, for a relatively non-existing loss in precision (for most purposes). It is as simple as adding .to_fp16() on your learner.
Questionaire
- Why do we first resize to a large size on the CPU, and then to a smaller size on the GPU? A. This concept is known as presizing. Data augmentation is often applied to the images and in fastai it is done on the GPU. However, data augmentation can lead to degradation and artifacts, especially at the edges. Therefore, to minimize data destruction, the augmentations are done on a larger image, and then RandomResizeCrop is performed to resize to the final image size.
- Practice some Regex tutorials
- What are the two ways in which data is most commonly provided, for most deep learning data sets? A. Individual files containing the data, such as with images or video. Or tabular data, which is more old school, such as in a CSV file
- Look up the documentation for
Land try it out a bit - Look up the documentation for the Python
pathlibmodule and try it out a bit - Give two examples of ways that image transformations can degrade the quality of the data? A. 1. Rotation can leave empty areas in the final image if done after scale-down. 2. Other operations may require extrapolation, which can degrade the image quality, since in essence you are filling in blanks.
- What method does Fast.ai provide to view the data in a
DataLoaders? A.show_batch() - What method doest Fast.ai provide to help you debug a
DataBlock? A. You can call.summary()on aDataBlockwhich tries to run theDataBlockpipeline and gives you a summary of what it does, where it errors, etc. This is a good way to check if you made your pipeline correct. - Should you hold off on training a model until you have thorougly cleaned your data? A. No. You should train a model as soon as you have a first rough draft ready. This gives you a good baseline to compare future (possibly more complicated models) to. If you have no baseline, you do not know how good your models are.
- Which 2 pieces combine into the
CrossEntropy Lossin pytorch? A. The CrossEntropy Loss Consists of asoftmax()function to make sure all the activations are between 0 and 1 and sum to 1, and aNegative Log Likelihoodto punish confident mistakes more than non-confident ones. -
- What are the two properties that
softmaxensures? Why is this important?* A. As mentioned above, the softmax ensures all activations are between 0 and 1, and they all sum up to 1. These are prerequisites in order to calculate things like likelihoods. It also makes the model able to pick 1 category among X categories.
- What are the two properties that
- When might you want your activations to not have these two properties? A. When you are doing something else than classification, for example regression or multi-label classification (where you want the model to possible pick more than 1 category).
- Calculate the
expandsoftmaxcolumns yourself from the table. A. Done - Why can’t we use
torch.whereto create a loss function for data-sets where our label can have more than two categories? A. sincetorch.wherehas the following definition:
where(condition, input, other, *, out=None) -> Tensor
You could maybe technically nest torch.where’s into eachother, but I doubt it returns anything useful.
15. What is the value of log(-2)?
A. The logarithm of a negative number is undefined, and not a real number.
16. What are two good rules of thumb for picking a learning rate from the learning rate finder?
A. Pick a learning rate 1 order of magnitude smaller than where the minimum loss was achieved or the last point where the loss was clearly improving.
17. What two steps does the fine_tune method do?
A. The whole method body of fine_tune is as follows:
self.freeze()
self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
base_lr /= 2
self.unfreeze()
self.fit_one_cycle(epochs, slice(base_lr/lr_mult, base_lr), pct_start=pct_start, div=div, **kwargs)
So it freezes the model, trains it, unfreezes it and trains it some more.
18. In Jupyter Notebook, how do you get the source code for a method or function?
A. You put ?? behind the method name. So in the previous question, to get the source code for fine_tune you call learn.fine_tune??
19. What are Discriminative Learning Rates?
A. Discriminative Learning Rates is the concept of using different learning rates for different layers, since different layers learn different things.
20. How is a Python slice object interpreted when passed as a learning rate to Fast.ai?
A. A slice object is interpreted as a range of learning rates, which will then be spread equally over the different layers.
21. Why is early stopping a poor choice when using 1cycle training?
A. If early stopping is used, the training may not have time to reach lower learning rate values in the learning rate schedule, which could easily continue to improve the model. Therefore, it is recommended to retrain the model from scratch and select the number of epochs based on where the previous best results were found.
22. What is the difference between Resnet50 and Resnet101?
A. Resnet101 is a model with a deeper architecture than Resnet50, namely 101 layers vs 50 layers.
23. What does to_fps16() do?
A. it causes pytorch to use floating points with less precision (half in fact) in order to save on memory, while sacrificing some correctness.