Chapter 7

June 10, 2025

Fast-AI Fastbook chapter 7, Training a state-of-the-art model

This chapter introduces some more advanced techniques for training an image classification model. To demonstrate this, we will be using imagenette, a subset of imagenet of 10 distinctive categories.

Lets first create a simple model that will serve as our base-line:

dblock = DataBlock(blocks=(ImageBlock(), CategoryBlock()),
                   get_items=get_image_files,
                   get_y=parent_label,
                   item_tfms=Resize(460),
                   batch_tfms=aug_transforms(size=224, min_scale=0.75))
dls = dblock.dataloaders(path, bs=64)

Nothing new so far in the data set up.

model = xresnet50(n_out=dls.c)
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)

dls.c gives us the amount of unique labels in our dataset, in this case 10. And since the xresnet50 models do not have pre-trained weights available, we are essentially training a model from scratch, we just use a pre-existing architecture

epoch 	train_loss 	valid_loss 	accuracy 	time
4 	0.585707 	0.541810 	0.825243 	01:03

Lets now consider some key techniques that can be used to further improve this model:

Normalization

When training a model, it helps if your input data is normalized (mean = 0, sd = 1). That is because when all your data is roughly between -2, and 2, your model is not dominated by outliers.

One very obvious reason is that you need to tune (but you don’t) the weight initialisations in the network according to the input range corresponding to that weight, e.g. let 𝑥1,𝑥2 be two distinct features and 𝑤1,𝑤2 be the corresponding weights. Also let the range of the feature be as follows: 𝑥1∈[0,1000],𝑥2∈[0,1]. When you initialise 𝑤𝑖 with numbers within [−1,1] for example, it won’t mean the same for 𝑥1 and 𝑥2. Probably, the sum 𝑤1𝑥1+𝑤2𝑥2 will be dominated by 𝑤1𝑥1 and you won’t see the effect of 𝑤2𝑥2 for some time unless you’re very lucky, and learning will be hindered significantly until the network is finally able to learn what 𝑤1 should have been in the first place.

Fortunately, normalizing the data is easy to do in fastai by adding the Normalize transform. This acts on a whole mini-batch at once, so you can add it to the batch_tfms section of your data block. You need to pass to this transform the mean and standard deviation that you want to use. (If you do not pass any statistics to the Normalize transform, fastai will automatically calculate them from a single batch of your data.)

def get_dls(bs, size):
    dblock = DataBlock(blocks=(ImageBlock, CategoryBlock),
                   get_items=get_image_files,
                   get_y=parent_label,
                   item_tfms=Resize(460),
                   batch_tfms=[*aug_transforms(size=size, min_scale=0.75),
                               Normalize.from_stats(*imagenet_stats)])
    return dblock.dataloaders(path, bs=bs)
     
dls = get_dls(64, 224)

After training it again:

epoch 	train_loss 	valid_loss 	accuracy 	time
4 	0.577889 	0.550673 	0.824496 	01:06

Although it did not seem to do much here, Normalization becomes very important when using pre-trained models (which should be your default go-to). If a pre-trained model is trained on normalized data, and you feed it non-normalized data, the model will be seeing and thus learning something completely different from what you intended. That’s why the statistics of the input data are often published together with a model.

The fact that we didn’t have to use this earlier was because we were using pre-trained models already available in the Fast.ai framework using vision_learner. In that case FastAI does the normalization for us.

Progressive Resizing

In the previous chapter, we ultimately trained the model with a size of 224x224 after a presize of 460. But who says we have to train the model completely on the same size images. Yes they all have to be the same size during an epoch / batch, but inbetween we can play with the size. This way we might be able to improve the time we spend training. That’s what progressive resizing is: Gradually using larger and larger images while we are training. We know that the kind of things a neural network learns are not at all related to size, but to gradients, edges and patterns. In essence, progressive resizing is another form of data augmentation, a concept which in general leads to better generalization of a model, since it has seen more different kinds of images.

So in the previous chapter we already created the get_dls function, where we can pass a batch size and an image size, and get the corresponding dataloaders. With this function it is as simple as:

dls = get_dls(128, 128)
learn = Learner(dls, xresnet50(n_out=dls.c), loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy)
learn.fit_one_cycle(4, 3e-3)

First we learn for 4 epochs on 128x128 images. Then we learn for 5 more on 224x224 images:

learn.dls = get_dls(64, 224)
learn.fine_tune(5, 1e-3)

Giving us the following results:

epoch 	train_loss 	valid_loss 	accuracy 	time
4 	0.414880 	0.431332 	0.863331 	01:06

An improvement of 4% accuracy compared to the previous technique, with the added benefit that the first few epochs train much faster, since the images are smaller, yet the model still picks up relevant information from them.

Note that for transfer learning, progressive resizing may actually hurt performance. Training on smaller images than the model was initially trained on may damage the pretrained weights, as you are essentially backwards progressive resizing.

Test-Time Augmentation

So now we come into the domain of some-what weirder things (imo). One of these is to apply data-augmentation to your validation data, and averaging (or maxing) the model’s prediction over all augmented versions of a single image. It CAN result in dramatic improvements of accuracy, but ofcourse your validation step will take longer than normal. By default, Fast.ai uses the unaugmented center crop image plus 4 randomly augmented images and it can be called as:

preds,targs = learn.tta()
accuracy(preds, targs).item()

0.8737863898277283

We get another percent of accuracy.

Mixup

Mixup is essentialy a non-standard data augmentation technique which sounds really simple, and you wonder why noone came up with it sooner. Basically, you smoosh 2 images from your dataset together into a new image. You basically perform $x*image1 + (1-x)*image2$ for each part. So you do it for the independent variable, but also for the dependent variable. If the dependent variable is one-hot encoded, you end up with a new encoding like this:

[0, 0, 1, 0, 0, 0, 0, 0, 0, 0] and [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

and we take x as 0.3 we end up with the following encoded dependent variable:

[0, 0, 0.3, 0, 0, 0, 0, 0.7, 0, 0]

So we add categories by mixing images, yet the amount of categories stays the same (otherwise the one-hot encoding would become a mess.)

So here is how we train a model using this technique:

model = xresnet50(n_out=dls.c)
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), 
                metrics=accuracy, cbs=MixUp())
learn.fit_one_cycle(5, 3e-3)

cbs is a parameter you can pass callbacks to, something we will explore in more detail later. But what actually happens?

Its gonna be harder to train, because it will be harder to see what is in each image and the model has to predict 2 labels + their weights.
Overfitting will become less of a problem, since the model will encounter different variants of smooshed up images.
This requires much more epochs to train by quite a large amount. Usually 80+

While it is actully quite straightforward to visualise using images, this technique can be performed on all kinds of data. Next to that, remember that softmax and sigmoids are functions that approach 1 or 0, yet the output can never be those numbers. This means our loss can never be perfect, and with longer training, our model will keep pushing its activations towards the extreme values.

With Mixup we no longer have that problem, because our labels will only be exactly 1 or 0 if we happen to “mix” with another image of the same class. The rest of the time our labels will be a linear combination, such as the 0.7 and 0.3 we got in the church and gas station example earlier.

And 0.7 and 0.3 are fine valid numbers for softmax and sigmoid functions to achieve.

Label smoothing

So for classification, if we dumb it really down, we train the model to return a 1 for a single category, and 0 for all other categories. Even if the model returns 0.9999 for a single category that means it is not good enough, and it will still try to improve. This will unmistakenly lead to overfitting and a lack of generalization in your model. So how could we encourage our model to a bit more confident with non-perfect results, so it will generalize better? To a human the tensors [0.1, 0.1, 0.1, 0.1, 0.96] and [0, 0, 0, 0, 1] are pretty much identical. You can be pretty sure the 5th category is the right one. So lets one-hot encode our dependent variable like the first tensory instead of all black and white like the second one. Just make sure all your labels still add up to 1 :) To use this with Fast.ai, its again very simple. You just pass in the relevant loss function:

model = xresnet50(n_out=dls.c)
learn = Learner(dls, model, loss_func=LabelSmoothingCrossEntropy(), 
                metrics=accuracy)
learn.fit_one_cycle(5, 3e-3)

Label smoothing also requires a lot of epochs, similar to Mixup.

Questionnaire:

What is the difference between ImageNet and ImageNette? When is it better to experiment on the one instead of the other. A. ImageNette is a subset of ImageNet, with 10 distinct categories instead of 1000. Its main use is quick experimentation, see if techniques like the ones mentioned above have any effect. If you find they do, its always nice to extrapolate to the bigger data-set.
What is normalization? A. Normalization is the scaling of your input data to some standard distribution. Usuallly a distribution where the mean is rougly 0, with a standard deviation of 1. This way outliers have less of an impact, variables do not influence eachother if the difference between them was great before normalization.
Why didn’t we have to care about normalization with pre-trained models? A. Since those models are already pre-trained with known statistics, Fast.ai just plugs in the correct Normalize function and uses the statistics that belong to the model.
What is Progression Resizing? A. Progressive resizing is the proces of gradually using larger images during training. If done correctly, no value is lost in resizing the image, but it does speed up training, allowing the first few epochs to train way quicker using smaller images than later epochs.
Implement Progressive Resizing in your own Project. A. TBD
What is Test Time Augmentation? How do you use it in Fast.ai? A. Test Time Augmentation is the process of using image augmentation on validation set images, and taking the mean or max of the result. You can use this in fastAI using .tta() on your learner.
Is using TTA at inference slower or faster than regular inference? Why? Using TTA at inference will make it slower than normal, since it using image transformations to gather an average prediction. FastAI will use 5 images for every inference image, making it technically 5 times slower.
What is Mixup? How do you use it in FastAI? A. Mixup is essentially blending two independent variables into a new dependent variable, with a new label with the same weights as the independent variable. Think of 2 images, a dog and a cat. We create a new image that is 30% dog and 70% cat, with a new label of 30% dog and 70% cat. You can use this in FastAi by passing in the Mixup() method as a callback. This can drastically improve your model, at the cost of extreme training times (80+ epochs)
Why does Mixup prevent the model from being too confident? A. It will “never” overfit, since it will keep seeing new images (they are created of random existing images).
Why does training with Mixup for five epochs end up worse than training without Mixup? A. Mixup essentially causes the model to be harder to train, because it will be harder to see what is in each image and the model has to predict 2 labels + their weights.
What is the idea behind label smoothing? A. The idea is that sigmoid or softmax activations are never able to reach 0’s or 1’s, yet we encode our dependent variable that way. That means the model will keep pushing for more extreme activations in order to reach unreachable values. By offering it a way to reach perfect values by encoding dependent variables to a bit less extreme (instead of [0, 0, 1] you require [0.01, 0.01, 0.97]) the model will be able to generalize better and reach optimal loss before overfitting.
What problems in your data can label smoothing help with? A. It mainly helps with imperfectly labelled data (which is all data, lets be honest.)
When using label smoothing with five categories, what is the target associated with the index 1? A. Already mentioned, but with 5 categories, instead of [0, 0, 0, 0, 1] you’ll smooth it to [0.01, 0.01, 0.01, 0.01, 0.95].
What is the first step to take when you want to prototype quick experiments on a new dataset? A. Create a baseline model once you have your data in order. This way you can compare every improvement you try to make with a very naive model. If it becomes worse, don’t bother.