Chapter 6

June 10, 2025

Fast-AI Fastbook chapter 6, Multi-Category and Regression

So up until now we have learned to do simple image recognition into a single category, and learned some ways to optimize our trainings and improve our models.

In this chapter we will look at two other types of Computer Vision problems: Multi-Category Classification and Regression. In the process we will learn more about output activations, and more types of loss functions.

Multi-Label Classification

This refers to the problem of identifying the categories of objects in images into more than 1 (or 0) categories. With a single-category classification, the model always outputs something, even if you feed it complete trash. That might not be what we want. On the other hand, an image may contain multiple objects, belonging to different categories, and we might want to know about all of them, not just the most prominent one.

So the model that we will be using is gonna be trained on the PASCAL data set, which looks like this: with the fname column containing the name of the corresponding image. As you can see the list of labels is a space-delimited string.

Alright, but this data is a data-frame, and Pytorch models only accept DataLoaders? Lets walk through the steps. Remember:

DataSet is a collection dat returns a tuple of your idependent and dependent variable for a single item
DataLoader is an iterator that provides a stream of mini-batches, where each mini-batch is a tuple of a batch of idependent variables and a batch of dependent variables.
DataSets is an object that contains a training Dataset and a validation Dataset.
DataLoaders is an object that contains a trainng DataLoader and a validation DataLoader. *DataBlock is a container to quickly build DataSets and DataLoaders

Using Notebooks, it is simple to create these objects, since you can just take it one step at a time, and check your data along the way. Lets start with an empty Datablock and go from there dblock = DataBlock() We can then feed it our dataframe and turn it into a datasets object: dsets = dblock.datasets(df() It automatically creates training and validation data-sets for us: dsets.train, dsets.valid Looking at the first object of the training set: x, y = dsets.train[0] returns the following: A tuple of identical objects… That is because we have’nt told the DataBlock object that what the x and y variables are, so it just kind of assumed. We can fix that: dblock = DataBlock(get_x = lambda r: r['fname'], get_y = lambda r: r['labels '])

dsets = dblock.datasets(df) dsets.train[0] now returns (008663.jpg, car person) Better, but not perfect yet. We still need to convert the independent variable to a complete path in order to get the image, and we need to split the dependent variable on spaces to get a list of categories:

def get_x(r): return path/'train'/r['fname']
def get_y(r): return r['labels'].split(' ')
dblock = DataBlock(get_x = get_x, get_y = get_y)

So now we have tuple of (string, list string) but we need a tuple of (image, multi-category). Luckily DataBlock can help with that

dblock = DataBlock(blocks = (ImageBlock, MultiCategoryBlock),
get_x = get_x, get_y = get_y)

returns (PILImage mode=RGB size=500x375, TensorMultiCategory([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.])) for a single training item, which is what we want. So why is this One-hot encoded? That is because if we only had integers that represented the position of the label in the vocabulary, every dependent variable would be of a different length. For an image with 3 categories we would have [1, 34, 17] and for an image with a single categorie [1]. That would not work, since all tensors need to have the same length.

Up til now we ignored the is_valid column of the data-set. Which could be fine, but there is probably a reason the creators of the dataset included it. So lets take it into account as well.

def splitter(df):
   train = df.index[~df['is_valid']].tolist()
   valid = df.index[df['is_valid']].tolist()
   return train,valid

dblock = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
                  splitter=splitter,
                  get_x=get_x, 
                  get_y=get_y)

You think we would be done by now? If you run .summary() on this it would fail. Images are still of different sizes. So the last step is a resize:

dblock = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
                   splitter=splitter,
                   get_x=get_x, 
                   get_y=get_y,
                   item_tfms = RandomResizedCrop(128, min_scale=0.35))
dls = dblock.dataloaders(df)

And now we are done with preparing our data. We can start training a model now, but lets dig a bit deeper into the default loss function that fast.ai picks for this kind of problem

Binary Cross-Entropy

So we know a learner needs 4 things, The Model, The DataLoaders object, an Optimizer, and a loss function. Let’s create a learner, and from its activations learn about the loss function: learn = vision_learner(dls, resnet18) (vision learner automatically adds the SGD optimizer and Binary Cross-Entropy loss function)

Lets give it a single batch of our idependent variable (the image) and lets see what its activations are (so before any loss calculation is done): x, y = to_cpu(dls.train.one_batch()) activs = learn.model(x) activs.shape returns torch.size([64, 20]) Why are the activations of this shape? Apparantly a batch is 64 images, and apparantly there are 20 categories, so we get 1 activation for each category. Let’s see what those look like: activs[0]

TensorBase([-1.4608, 0.9895, 0.5279, -1.0224, -1.4174, -0.1778, -0.4821, -0.2561, 0.6638, 0.1715, 2.3625, 4.2209, 1.0515, 4.5342, 0.5485, 1.0585, -0.7959, 2.2770, -1.9935, 1.9646], grad_fn=<AliasBackward0>) Cool, numbers, looks good. They are not yet between 0 and 1, since that is done later using a scaling function. But last chapter we learned that Softmax was the go to for basically every classification problem. Well, except this one. Remember, this is a multi- category problem. Softmax makes all predictions sum up to 1, and tends to push a single activation. We do not want that. We also cant use Negative Log-Likelihood since that also pushes a single activation. So we want:

Acitvations scaled between 0 and 1, but not necessarily summing to 1 -> Sigmoid
We still want to punish confident mistakes harder than non-confident ones log

This leads us to the Binary Cross Entropy loss function, which is defined as follows:

def binary_cross_entropy(inputs, targets):
	inputs = inputs.sigmoid()
	return -torch.where(targets == 1, inputs, 1- inputs).log().mean()

and ofcourse there is a pytorch equivalent already, so we do not have to define this ourselves nn.BCEWithLogitsLoss to get both the log and sigmoid in one. There is nn.BCELoss which also calculates Binary Cross Entropy on a one-hot-encoded target, but it does not include the sigmoid. There is probably a use case for that.

So now we have the loss function covered. What about the metric? We can’t just use accuracy, since that also only works for single-category problems. So how do we decide what is a 0 and what is a 1 in a multi-category problem? We pick a threshold. Everything above that threshold is a 1, everything below is a 0. This makes our metric definition as follows:

def accuracy_multi(inp, targ, thresh = 0.5, sigmoid = true):
	if sigmoid: inp = inp.sigmoid()
	return ((inp> thresh) == targ.bool()).float().mean()

So now we are finally ready to train the model properly: learn = vision_learner(dls, resnet50, metrics = partial(accuracy_multi, thresh = 0.2)) learn.fine_tune(3, base_lr=3e-3, freeze_epochs = 4)

partial is just python’s way of passing a function using partial application.

Picking the right theshold in this case can be vital to the accuracy of your model. Too low and you will get too many false positives, Too high and you will miss categories the model might be predicting correctly. This can be checked by retrieving the predictions of the model and plotting the accuracy with different thresholds: preds, targs = learn.get_preds()

xs = torch.linspace(0.05,0.95,29)
accs = [accuracy_multi(preds, targs, thresh=i, sigmoid=False) for i in xs]
plt.plot(xs,accs);

Regression

Allright, now on to the fun part. Classifying things is fun, but regression is where the real power lies, since you are no longer limited by a finite amount of categories. We are predicting numbers instead! For this example, we are gonna predict the center of a person’s face in an image. So the activation layers will basically return an (x,y) tuple of coordinates.

So we are gonna skip the data preparation, and dive straight into creating the final DataBlock

biwi = DataBlock(blocks = (ImageBlock, PointBlock),
	get_items = get_image_files,
	get_y = weird_function,
	splitter = FuncSplitter(lambda o: o.parent.name == '13'),
	batch_tfms = aug_transform(size = (240, 320)))

So what is this FuncSplitter ? So we want our validation set to be a person the model has never seen before, in order to properly assess its capabilities. So we pick a person (in this case #13) that is taken out of the training set. PointBlock tells Fast.ai that we are predicting Coordinates. Keep in mind that Fast.ai scales these coordinates appropriately together with the data augmentation of the input images. You might be required to do this manually with other libraries.

Next, lets look at the shape of the tensors and see if we understand why they are the way they are: xb, yb = dls.one_batch() xb.shape, yb.shape returns [64, 3, 240, 320], [64, 1, 2] Does this make sense? It does. With a batch size of 64, 3 colour channels (RGB) and 240x320 pixels after the resize, the shape of xb makes sense. The shape of yb also makes sense. The batch size is also 64, its a single output tensor with 2 integers representing x and y coordinates.

Lets train the model: learn = vision_learner(dls, resnet18, y_range(-1, 1)) So what does the y_range do? We use it to tell fast.ai the range of our targets. In Pytorch coordinates are always scaled to be between -1 and +1, so we tell the model that. It is implemented using sigmoid_range which is implemented as follows: def sigmoid_range(x, lo, hi): return torch.sigmoid(x) * (hi - lo) + lo. The sigmoid scales the value between 0 and 1, times the average value of the range, + the lowest value of the range.

What is the loss function? Regression Loss Functions are comparatively easy. Since you are estimating a float, you can just identity how far away the prediction is, possibly squaring it for readability. Regression type problems often use old-fashioned MSE or RMSE loss functions.

Conclusion

So having seen the most used loss functions, remember this:

For single-label classification -> CrossEntropy Loss using nn.CrossEntropyLoss
For multi-label classification -> Binary CrossEntropy Loss using nn.BCEWithLogitsLoss
For Regression -> (Root) Mean Squared Error using nn.MSELoss or nn.RMSELoss

Questionnaire

How could multi-label classification improve the usability of the bear classifier? A. Multi-label classification is able to identify multiple categories, but in this case even more important, it can also identify NO category. If you feed a bear classifier an image of something that is not a bear, you want the classifier to notice this.
How do we encode the dependent variable in a multi-label classification problem? A. We encode the variable using one-hot-encoding. This ensure the dependent variable tensors are all the same size, which is required.
How do you access the rows and columns of a dataframe as if it was a matrix? A. You can access rows and columns of a DataFrame with the iloc property, as if it were a matrix: df.iloc[0, :]
How do you get a column by name from a dataFrame? A. By indexing the dataframe directly: df['fname']
- What is the difference between a Dataset and a DataLoader?* A.
- DataSet is a collection dat returns a tuple of your idependent and dependent variable for a single item
- DataLoader is an iterator that provides a stream of mini-batches, where each mini-batch is a tuple of a batch of idependent variables and a batch of dependent variables.
What does a Datasets object normally contain? A. it contains a training Dataset and a validation Dataset
What does a DataLoaders object normally contain?* A. It contains a training DataLoader and a validation DataLoader
What does lambda do in Python? A. it allows us to pass one-use anonymous functions in-line into arguments that require functions.
What are the methods to customize how the independent and dependent variables are created with the data block API? A. get_x is used to specify how the independent variables are created. get_y is used to specify how the dependent variables are labelled.
Why is softmax not an appropriate output activation function when using a one hot encoded target? A. softmax makes sure all the activations are scaled between 0 and 1, which is not that bad, but it also scales them all such that they sum up to 1. This makes it so it pushes a single activation to the forefront, something we do not want when using one-hot encoding for multi-label classification.
Why is nll_loss not an appropriate loss function when using a one-hot encoded target? A. Same reason. nll_loss pushes a single activation, which we do not want for this particular kind of problem.
What is the different between nn.BCELoss and nn.BCEWithLogitsLoss? A. The WithLogits part explains it. It roughly means ‘mapped to probabilities using softmax’. nn.BCELoss therefor does not perform a softmax on the activation before calculating the Binary Cross-Entropy Loss.
Why can’t we use regular accuracy in a multi-label problem? A. Accuracy in its standard definition is only defined for single-label problems.
When is it okay to tune a hyperparameter on the validation set? A. In essence never, HOWEVER, it can be one carefully, if the relationship allows it. But be careful with it.
How is y_range implemented in fastai? A. It is implemented using the sigmoid range to scale the value to be between 0 and 1, after which it is further scaled to be within the range required by value * (hi - low) + low
What is a regression problem? What loss function should you use for such a problem? A. Regression is a problem in which the activations are not categories, but proper numbers instead. The loss functions then become quite easy, since you can use the MSE or RMSE.
What do you need to make sure the fastai library applies the same data augmentation to your input images and your target point coordinates? A. Within fastai, nothing. It is done automatically IF you use the right DataBlock. For other libraries you might have to do it yourself.