Chapter 1

June 10, 2025

Fast-AI Fastbook chapter 1, intro

Deep learning is used in many different areas of, some of which are:

Natural Language Processing (NLP). Speech recognition, summarizing documents, etc.
Computer vision: Image interpretation, face recognition
Medicine: X-ray image anomaly detection, diagnosing
Biology: Protein synthesis sequencing
Image Generation: Removing noise, converting images, generating images from text.
Recommendation Systems
Robotics

What is Machine Learning?

Machine learning, is like normal programming, a way to get computer s to do a specific task. But instead of minutely telling the computer how to do something step by step, we give the computer loads of examples of the problem to solve (and its solution) and let the computer figure out a generic way how to solve it by itself.

The basic loop of a how a computer does that is captured by the following image: alt

Going through the various parts of it:

input: Any function should always have inputs and outputs. This is the data we feed into the function
weights: These are just another set of variables, that belong to the function itself, instructing how the function operates
results: Basically the output of the model, as a result of the given inputs and the weights that have acted on that input
performance: This is different from the results of the model, since performance is dependent on the context in which the model operates. f.e. a model’s results can be the moves to perform in a chess game, while the performance is measured in those moves resulting in a win, loss, or a draw.
update: If we measure the performance, we can use it to alter the weights of the model, so in a next iteration, its performance improves.

Once a model is trained, the weights will become a permanent part of the model, no longer being adjustable.

What is a Neural Network?

A neural network is nothing more than a mathematical function which is extremely flexible depending on its weights. This means we can care less about the actual model, and we can spend more time and energy on the process of training models, finding good inputs and weight assignments. So how do we do that? How do we update weights of a neural network? Does it matter what the neural network does? Conveniently, there is also an extremely generic way of updating any neural network’s weights, Stochastic Gradient Descent (SGD).

Since AI and Deep Learning are quite a research-heavy field, a lot of terminology and jargon is used, some of the more common ones include:

The functional form of the model is called its architecture.
the weights are called parameters
The results of the model are called the predictions
The predictions are calculated from the independent variable, which is the data not including the labels
The measure of performance is called the loss
The loss not only depends on the predictions, but also the correct labels, known as targets

All this makes the above mentioned loop look more like this (but it is identical in terms of content):

Limitations

A model cannot be created without data
A model can only learn patterns seen in the input data used to train it
Just having examples as input data is not enough, that data needs to have labels saying what’s what.
A model creates predictions , not recommended actions. Interpreting a model’s result will still depend on the context in which the model is used.

So how does the basic image recognizer work?

Im not gonna go into detail about basic Python stuff, but the main meat is in this fast.ai specific function:

dls = ImageDataLoaders.from_name_func(
    path, get_image_files(path), valid_pct=0.2, seed=42,
    label_func=is_cat, item_tfms=Resize(224))

Because we feed the model images, we are using an ImageDataLoader. We tell the function how to get the labels from the data-set using the label_func parameter, which accepts a function. In this case is_cat returns true if the first char in the filename is UpperCase, which is how this data-set was set-up. We tell fast.ai to hold out 20% of the data with the valid_pct parameter. This data is not used for training, but is used to measure the accuracy of the model. Finally we define the Transforms. A Transform contains code that is applied automatically during training. They are also just basically python functions. There are 2 kinds:

item_tfms are applied to each item (image in this case)
batch_tfms are applied to a batch of items at a time, using the GPU. 224 pixels is just a standard size for historical reasons. You can increase this at the cost of more compute power.

So why do we keep 20% of the data separate to validate our model’s performance? In essence, if you train a large enough model for enough a long enough time, it will eventually memorize the label of every item in the data-set. This is at all useful, since when deploying a model you want it to perform well on data it is not trained on / has never seen before. A model that is too adapted to the training set is called overfit.

Overfitting is the single most important and challenging issue when training. For all models and for everyone involved.

Hence we keep 20% of the data separate, so we always have data that the model has ’never seen’ and we can validate both its performance as well as its tendency to overfit.

Creating a neural network

Alright, so now we have a DataLoaders object. with this we can tell fast.ai to create a Convolutional Neural Network (CNN), using a single line learn = vision_learner(dls, resnet34, metrics = error_rate). In this case we use a pre-trained architecture called resnet, with 34 layers. In practice you are only really spending much time on picking architectures if you are actually doing research in the field. The last parameter, metrics specifies which function we want to use to measure the quality of the model’s predictions (remember the performance part at the end of the flow-chart). They can become quite complicated, but many are pre-built into all kinds of frameworks. Since this is a classifier, we can use the bog standard error rate, which is the % of images classified wrong. There is however a small but important distiction between loss and metrics. Metrics are for humans to understand, loss is optimized for SGD (or computers in general) to optimize. Sometimes they are the same, sometimes they are not.

Pre-trained vs self-trained

In practice, you should always use a pre-trained model like this Resnet model. They are already trained on loads of data, using loads of someone else’s money. It would be foolish to try and recreate this from scratch on your own little laptop. Resnet for example has already been trained on 1.3 million photo’s. And while that may not be the photo’s you use, it is already very capable of things like edge detection, gradiant- and color detection which are universally needed.

When using a pre-trained model, vision_learner will remove the last layer of that model. That layer is always specially customized to the original task. It then replaces it with one or more layers of randomized weights, ready to train. This part of the model is known as the head.

Using a pre-trained model for a task different to what it was originally trained for is called transfer learning.

Fitting the model

Finally we tell fast.ai to fit the model learn.fine_tune(1) In order to fit the model, we need to tell it how many times to look at each image (# of epochs). In this case 1. But why is it called fine_tune when you are essentially fitting instead? Because we have a pre-trained model, filled to the brim with usefull weights. We don’t want to throw away all that. When calling fine_tune fast.ai will do the following under the hood:

Use one epoch the fit just those parts of the model necessary to get the new random head to work with your data-set (adjust to the number of output categories, etc).
Use the number of epochs requested to fit the entire model, updating the weights of the later layers faster than the earlier layers (so the new random head will update faster than pre-existing edge-detection).

But how does it all work?

So we have an image recognizer that can distinghuish between cats and dogs from just pictures. But how?? What is it actually doing? This can be summarized by a single image, but we will elaborate a bit more to get a better picture. First, the image:

The images with the gray background on the left shows the reconstructed weights as a picture, while the section on the right shows the parts of training images that most strongly matched that set of weights. It looks like the model learned itself to detect certain things like sunsets, eyes, and door-ish structures. Going a couple layers further, we see the following:

2 layers later it learned to detected car wheels, flower petals and typewriters… So you can see that pre-trained models have a lot of very generic identification potential in their deepest layers, where the later layers become more specific to the individual problem the model was trained on, hence the reinitialization of those last layers with random weights when using a pre-trained model.

Ok, but still… How?

We will get the the nitty-gritty details later. I know, bummer, I like nitty-gritty details as well, but that’s just not how this course was set-up.

Allright, so what else can an image recognizer do?

So image recognizers can recognize images. Pretty dull you would think. But think about all the different things that can be represented as images:

Sounds can be represented as images using spectrograms, turning an image recognizer into a sounds recognizer.
Time series can be represented as images, using all kinds of fancy transformations
Mouse movements and clicks can be transformed into images of the path
Binary data can be transformed into a grayscale image, which can be used for malware analysis. Malware can be insanely complicated, when turned into an image there is clear distinction

And while it might feel like hacky work-arounds, these have all been used as techniques to beat state-of-the-art scientific models in their respective field. It is all about the results.

Cool, so image classification it is?

Well no. As we said in the beginning, Neural Networks are insanely flexible, and can pretty much do anything. This is just one example of a classification problem. Following now are some more examples of where Neural Networks really shine.

Segmentation (or the lost art of not hitting pedestrians)

Segmentation is the offical jargon for ‘recognizing the content of every individual pixel’. This helps self-driving cars localize objects in a picture. Small example using fast.ai:

path = untar_data(URLs.CAMVID_TINY)
dls = SegmentationDataLoaders.for_label_func(path, bs = 8, fnames = get_image_files(path/"images"), 
	label_func = lambda o: path/'labels'/f'{o.stem}_P{o.suffix}', 
	codes = np.loadtxt(path/'codes.txt', dtype = str)
learn = unet_learner(dls, resnet34)
learn.fine_tune(8)

That is all it takes to create a segmentation learner. As you see, it is practically identical to the image learner from earlier.

NLP (or the lost art of faking to be a human)

Computers can now effortlessly generate text, translate, analyze comments, etc. Here is the code to train a model to classify sentiment in movie reviews better than anything that existed in the word just 5 years ago

dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid = 'test')
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult = 0.5, metrics = accuracy)
learn.fine_tune(4, 1e-2)

Tabular data (or the lost art of finally getting rid of Excel)

Ofcourse it is also possible to create a Neural Network from tabular data. The only downside is that there are not a lot of pre-trained models for tabular data, so you are gonna have to train your own. This is basically because old-fashioned statistics is already really good at creating models for tabular data. Gradient Boosted Random Forests for example usually outperform Neural Networks on this kind of data. Tabular data usually has a lot of different data-types, missing data, feature correlations, etc. which other techniques are just better at.

Recommendation systems (or the last art of getting people to buy more shit)

Finally (for this list of examples at least), Neural Networks are also heavily used for recommendation systems. In essence they try to predict which other movies or products a user might like, based on previous data.

path = untar_data(URLs.ML_SAMPLE)
dls = CollabDataLoaders.from_csv(path/'ratings/csv')
learn = collab_learner(dls, y_range(0.5, 5.5))
learn.fine_tune(10)

This is the first time you see the y_range parameter, and that is because we try to predict a continuous range, not a classification.

Validation data and Test data

Allright, so we kept 20% of our data separate for validation after each training round. But is that enough to make sure our model does not overfit?

Not really. Normally you would never pick the right parameters at once, run the training loop and call it a day. Developing and fine-tuning a model requires many iterations to get right. And while in each individual training session the validation set only gets used for performance evaluation, we as the models use that evaluation to fine-tune the model. This causes the model to be indirectly influenced by the data in the validation set. The solution to this is the reserve another set of data, the test set. This data is not used in the normal iterative process of refinement, not even by us. Instead, it is only used at the very end to evaluate the model.

Picking the right test set.

It is important that any data that is present in the test set is representative of the real-life application in which the model will be used. Splitting a time-series data-set randomly into train/test for example would be useless, since the model would have to predict future events. It woul be better to cut of the last period of your data, and use that as validation.

In essence, a test set should contain data that is both representative for the real-world application your model is gonna be used in as well as data that your model has never seen before.

Questionaire

Dumb question
Name five areas where deep learning is now the best in the world
1. Medicine
2. Scanned image disease detection
3. Biology protein synthesis development
4. Facial recognition
5. Natural Language Processing (NLP) translation / sentiment analysis
What was the name of the first device that was based on the principle of the artificial neuron A. The Mark I Perceptron.
Based on the book of the same name, what are the requirements for parallel distributed processing (PDP) A. 1. A set of processing units 2. A state of activation 3. An output function for each unit 4. A pattern of connectivity among units 5. A propagation rule for propagating patterns of activities through the network of connectivities. 6. An activation rule for combining nthe inputs 7. A learning rule 8. An environment within which the system must operate
What were the two theoretical misunderstandings that held back the field of neural networks? A. Single layer neural networks being the norm and a single layer being unable to perform certain (XOR) operations.
What is a GPU? A. A GPU is a Compute Resource which excels in performing many parallel simple tasks, such as calculations that need to be done on a per pixel basis to render video. This also makes it extremely useful for Deep Learning, since that requires a lot of parallel, relatively simple, operations.
Open a notebook and execute a cell containg 1 + 1. What happens? A. it outputs 2.
Follow through each cell of the stripped version of the notebook for this chapter. Before executing, guess what will happen A. DONE
Complete the Jupyter Notebook online appendix A. DONE
Why is it hard to use a traditional computer program to recognize images in a photo? A. A traditional computer program is being told what to do at every level. It does not mimic the way our brain works at all. Our brain is exceptionally good at tasks like recognition, so it would seem that a computer program that mimics that behaviour would be good at the same tasks as well.
What did Samuel mean by “weight assignment”? A. weight assignment means assigning values to the weights and updating them after each iteration as needed.
What term do we normally use in deep learning for what Samuel called “weights”? A. weights are also often called parameters
Draw a picture that summarizes Samuel’s view of a machine learning model. A.
Why is it hard to understand why a deep learning model makes a particular prediction? A. Deep learning models kind of operate as black boxes. Input goes in, output comes out. But whatever happens inside is often thought of as insanely complicated, or just plain magic. While actually it is not that complicated once you get the hang of it.
What is the name of the theorem that shows that a neural network can solve any mathematical problem to any level of accuracy? A. The theorem is called the Universal Approximation Theorem
What do you need in order to train a model? A. A question that needs answering, and data. In some cases not even a lot of data.
How could a feedback loop impact the rollout of a predictive policing model? A. A predictive policing model is created based on where arrests have been made in the past. In practice, this is not actually predicting crime, but rather predicting arrests, and is therefore partially simply reflecting biases in existing policing processes.
Do we always have to use 224×224-pixel images with the cat recognition model? A. Nope, 224 x 224 is just a historical artifact. images can be whatever size you want. However, the bigger your images, the more processing power needed. So it will always be a bit of a trade-off.
What is the difference between classification and regression? A. Classification is a categorical problem. A model tries to predict a value out of an existing set of values. Regression is a prediction with an unlimited amount of values, usually a float between 2 extreme values.
What is a validation set? What is a test set? Why do we need them? A. A validation set is used to validate a single run of model training. It is data which the model has not used for training, so it cannot have used it in updating the weights. A test set is an even more obscured set of data, which gets only used all the way at the end, after tweaking and iterating over the model. They are used to test the model’s ability to generalise.
What will fastai do if you don’t provide a validation set? A. Fastai will automatically create a validation dataset. It will randomly take 20% of the data and assign it as the validation set ( valid_pct = 0.2 ).
Can we always use a random sample for a validation set? Why or why not? A. You can always, but you probably should not. A validation set should very closely represent the data the model will be seeing after being deployed. So for a time-series model, you should use the most recent data as a validation set. For a human recognition task, you should add humans the model has not seen at all in the validation set, etc.
What is overfitting? Provide an example. A. Overfitting is the single greatest challenge in this field. It happens when a model is trained too much. It then starts to fit soo much to the data it has seen, that it loses its ability to generalise. This can lead to downright useles models.
What is a metric? How does it differ from “loss”? A. They are roughly the same, as they are both indicators for the performance of the model. However, the distinction is that metrics are used by people, loss is used by the computer / SGD algorithm. Some loss functions are barely legible, let alone understandable for humans, but are highly optimized for a computer to minimize.
How can pretrained models help? A. Pre-trained models are tremendously helpful, as they are usually trained with enormous data-sets for a long time. This makes them extremely useful to detect very general things, like edges, colors or gradients.
What is the “head” of a model? A. The head of the model is the final layer, the one that translates the inner works of the model into the final predictions. When using a pre-trained model, this layer is replaced with a new layer specific for the problem at hand.
What kinds of features do the early layers of a CNN find? How about the later layers? A. early layers usually find very generic features, such as edges, colors, gradients, and the likes. These combine into later layers, where the detection becomes more and more specific.
Are image models only useful for photos? A. No, because then they would be called “photo models”. A LOT of data can be constructed as images. Sound, binary, etc can all be converted to images with a bit of creativity.
What is an “architecture”? A. The architecture defines the mathematical model you are trying to fit.
What is segmentation? A. Segmentation is a pixelwise classification problem. By predicting a class for each pixel, you create polygons of pixels with the same class. This is used in object detection and the likes.
What is y_range used for? When do we need it? A. y_range is used to define a regression output’s bounds. Regression predicts a value instead of a category, so in theory it can go from -Inf to Inf. y_range limits / scales the outcome to a certain value range.
What are “hyperparameters”? A. hyper parameters are parameters that control parameters. They define HOW the model is trained.
What’s the best way to avoid failures when using AI in an organization? A. Key things to consider when using AI in an organization:
- Make sure a training, validation, and testing set is defined properly in order to evaluate the model in an appropriate manner.
- Try out a simple baseline, which future models should hopefully beat.