Introduction to Computer Vision with PyTorch

This is my personal notes from the course: "PyTorch Fundamentals", taught freely by Microsoft. [Link]

I. INTRODUCTION TO IMAGE DATA:

In computer vision, we normally solve one of the following problems:

Image Classification is the simplest task, when we need to classify an image into one of many pre-defined categories, for example, distinguish a cat from a dog on a photograph, or recognize a handwritten digit.
Object Detection is a bit more difficult task, in which we need to find known objects on the picture and localize them, i.e. return the bounding box for each of recognized objects.
Segmentation is similar to object detection, but instead of giving bounding box we need to return an exact pixel map outlining each of the recognized objects.

1) Images as Tensors:

Computer Vision works with Images. As you probably know, images consist of pixels, so they can be thought of as a rectangular collection (array) of pixels.
In the first part of this tutorial, we will deal with handwritten digit recognition. We will use the MNIST dataset, which consists of grayscale images of handwritten digits, 28x28 pixels. Each image can be represented as 28x28 array, and elements of this array would denote the intensity of corresponding pixel - either in the scale of range 0 to 1 (in which case floating point numbers are used), or 0 to 255 (integers). A popular python library called numpy is often used with computer vision tasks, because it allows to operate with multidimensional arrays effectively.
To deal with color images, we need some way to represent colors. In most cases, we represent each pixel by 3 intensity values, corresponding to Red (R), Green (G) and Blue (B) components. This color encoding is called RGB, and thus color image of size W×H will be represented as an array of size 3×H×W (sometimes the order of components might be different, but the idea is the same).
Multi-dimensional arrays are also called tensors. Using tensors to represent images also has an advantage, because we can use an extra dimension to store a sequence of images. For example, to represent a video fragment consisting of 200 frames with 800x600 dimension, we may use the tensor of size 200x3x600x800.

2) Import packages and load the MNIST Dataset:

#Import the packages needed.

import torch

import torchvision

import matplotlib.pyplot as plt

import numpy as np

PyTorch has a number of datasets available right from the library. Here we are using the well-known MNIST dataset of handwritten digits, available through torchvison.datasets.MNIST in PyTorch. The dataset object returns the data in the form of Python Imagine Library (PIL) images, which we convert to tensors by passing a transform=ToTensor() parameter.

from torchvision.transforms import ToTensor

data_train = torchvision.datasets.MNIST('./data',

download=True,train=True,transform=ToTensor())

data_test = torchvision.datasets.MNIST('./data',

download=True,train=False,transform=ToTensor())

3) Visualize the digits dataset:

Now that we have downloaded the dataset we can visualize some of the digits:

fig,ax = plt.subplots(1,7)

for i in range(7):

ax[i].imshow(data_train[i][0].view(28,28))

ax[i].set_title(data_train[i][1])

ax[i].axis('off')

4) Dataset structure:

We have a total of 6000 training images and 1000 testing images. It's important to split out the data for training and testing. We also want to do some data exploration to get a better idea of what our data looks like
Each sample is a tuple in the following structure:
1. First element is a tensor of 1x28x28 size
2. Second element is a label that specifies which digit is represented by the tensor

print('Training samples:',len(data_train))

print('Test samples:',len(data_test))

print('Tensor size:',data_train[0][0].size())

print('First 10 digits are:', [data_train[i][1] for i in range(10)])

All pixel intensities of the images are represented by floating-point values in between 0 and 1:

print('Min intensity value: ',data_train[0][0].min().item())

print('Max intensity value: ',data_train[0][0].max().item())

If you are planning to load your own images, it is important to make sure that all values are scaled to the range 0 - 1 before we start training a neural network.

II. TRAINING A DENSE NEURAL NETWORK:

The handwritten digit recognition is a classification problem. We will start with the simplest possible approach for image classification - a fully-connected neural network (which is also called a perceptron). We use pytorchcv helper to load all data we have talked about in the previous unit.

import torch

import torch.nn as nn

import torchvision

import matplotlib.pyplot as plt

import pytorchcv

pytorchcv.load_mnist()

1) Fully-connected dense neural networks:

A basic neural network in PyTorch consists of a number of layers. The simplest network would include just one fully-connected layer, which is called Linear layer, with 784 inputs (one input for each pixel of the input image) and 10 outputs (one output for each class).
As we discussed above, the dimension of our digit images is 1×28×28. Because the input dimension of a fully-connected layer is 784, we need to insert another layer into the network, called Flatten, to change tensor shape from 1×28×28 to 784.
We want n-th output of the network to return the probability of the input digit being equal to n. Because the output of a fully-connected layer is not normalized to be between 0 and 1, it cannot be thought of as probability. To turn it into a probability we need to apply another layer called Softmax.
In PyTorch, it is easier to use LogSoftmax function, which will also compute logarithms of output probabilities. To turn the output vector into the actual probabilities, we need to take torch.exp of the output.
It can be defined in PyTorch in the following way, using Sequential syntax:

net = nn.Sequential(

nn.Flatten(),

nn.Linear(784,10), # 784 inputs, 10 outputs

nn.LogSoftmax())

2) Training the network:

A network defined this way can take any digit as input and produce a vector of probabilities as an output. Let's see how this network performs by giving it a digit from our dataset:

print('Digit to be predicted: ',data_train[0][1])

torch.exp(net(data_train[0][0]))

As you can see the network predicts similar probabilities for each digit. This is because it has not been trained on how to recognize the digits. We need to give it our training data to train it on our dataset.
To train the model we will need to create batches of our datasets of a certain size, let's say 64. PyTorch has an object called DataLoader that can create batches of our data for us automatically:

train_loader = torch.utils.data.DataLoader(data_train,batch_size=64)

test_loader = torch.utils.data.DataLoader(data_test,batch_size=64) # we can use larger batch size for testing

The training process steps are as follows:
1. We take a minibatch from the input dataset, which consists of input data (features) and expected result (label).
2. We calculate the predicted result for this minibatch.
3. The difference between this result and expected result is calculated using a special function called the loss function
4. We calculate the gradients of this loss function with respect to model weights (parameters), which are then used to adjust the weights to optimize the performance of the network. The amount of adjustment is controlled by a parameter called learning rate, and the details of optimization algorithm are defined in the optimizer object.
5. We repeat those steps until the whole dataset is processed. One complete pass through the dataset is called an epoch.
Here is a function that performs one epoch training:

def train_epoch(net, dataloader, lr=0.01, optimizer=None, loss_fn = nn.NLLLoss()):

optimizer = optimizer or torch.optim.Adam(net.parameters(),lr=lr)

net.train()

total_loss,acc,count = 0,0,0

for features,labels in dataloader:

optimizer.zero_grad()

out = net(features)

loss = loss_fn(out,labels) #cross_entropy(out,labels)

loss.backward()

optimizer.step()

total_loss+=loss

_,predicted = torch.max(out,1)

acc+=(predicted==labels).sum()

count+=len(labels)

return total_loss.item()/count, acc.item()/count

train_epoch(net,train_loader)

Since this function is pretty generic we will be able to use it later in our other examples. The function takes the following parameters:
1. Neural network
2. DataLoader, which defines the data to train on
3. Loss Function, which is a function that measures the difference between the expected result and the one produced by the network. In most of the classification tasks NLLLoss is used, so we will make it a default.
4. Optimizer, which defined an optimization algorithm. The most traditional algorithm is stochastic gradient descent, but we will use a more advanced version called Adam by default.
5. Learning rate defines the speed at which the network learns. During learning, we show the same data multiple times, and each time weights are adjusted. If the learning rate is too high, new values will overwrite the knowledge from the old ones, and the network would perform badly. If the learning rate is too small it results in a very slow learning process.
Here is what we do when training:
1. Switch the network to training mode (net.train())
2. Go over all batches in the dataset, and for each batch do the following:

+ compute predictions made by the network on this batch (out)

+ compute loss, which is the discrepancy between predicted and expected values

+ try to minimize the loss by adjusting weights of the network (optimizer.step())

+ compute the number of correctly predicted cases (accuracy)

The function calculates and returns the average loss per data item, and training accuracy (percentage of cases guessed correctly). By observing this loss during training we can see whether the network is improving and learning from the data provided.
It is also important to control the accuracy on the test dataset (also called validation accuracy). A good neural network with a lot of parameters can predict with decent accuracy on any training dataset, but it may poorly generalize to other data. That's why in most cases we set aside part of our data, and then periodically check how well the model performs on them. Here is the function to evaluate the network on test dataset:

def validate(net, dataloader,loss_fn=nn.NLLLoss()):

net.eval()

count,acc,loss = 0,0,0

with torch.no_grad():

for features,labels in dataloader:

out = net(features)

loss += loss_fn(out,labels)

pred = torch.max(out,1)[1]

acc += (pred==labels).sum()

count += len(labels)

return loss.item()/count, acc.item()/count

validate(net,test_loader)

We train the model for several epochs observing training and validation accuracy. If training accuracy increases while validation accuracy decreases that would be an indication of overfitting. Meaning it will do well on your dataset but not on new data.
Below is the training function that can be used to perform both training and validation. It prints the training and validation accuracy for each epoch, and also returns the history that can be used to plot the loss and accuracy on the graph.

def train(net,train_loader,test_loader,optimizer=None,lr=0.01,epochs=10,loss_fn=nn.NLLLoss()):

optimizer = optimizer or torch.optim.Adam(net.parameters(),lr=lr)

res = { 'train_loss' : [], 'train_acc': [], 'val_loss': [], 'val_acc': []}

for ep in range(epochs):

tl,ta = train_epoch(net,train_loader,optimizer=optimizer,lr=lr,loss_fn=loss_fn)

vl,va = validate(net,test_loader,loss_fn=loss_fn)

print(f"Epoch {ep:2}, Train acc={ta:.3f}, Val acc={va:.3f}, Train loss={tl:.3f}, Val loss={vl:.3f}")

res['train_loss'].append(tl)

res['train_acc'].append(ta)

res['val_loss'].append(vl)

res['val_acc'].append(va)

return res

# Re-initialize the network to start from scratch

net = nn.Sequential(

nn.Flatten(),

nn.Linear(784,10), # 784 inputs, 10 outputs

nn.LogSoftmax())

hist = train(net,train_loader,test_loader,epochs=5)

This function logs messages with the accuracy on training and validation data from each epoch. It also returns this data as a dictionary (called history). We can then visualize this data to better understand our model training.

plt.figure(figsize=(15,5))

plt.subplot(121)

plt.plot(hist['train_acc'], label='Training acc')

plt.plot(hist['val_acc'], label='Validation acc')

plt.legend()

plt.subplot(122)

plt.plot(hist['train_loss'], label='Training loss')

plt.plot(hist['val_loss'], label='Validation loss')

plt.legend()

The diagram on the left shows the training accuracy increasing (which corresponds to the network learning to classify our training data better and better), while validation accuracy starts to fall. The diagram on the right show the training loss and validation loss, you can see the training loss decreasing (meaning its performing better) and the validation loss increasing (meaning its performing worse). These graphs would indicate the model is overfitted.

3) Visualizing network weights:

Now lets visualize our weights of our neural network and see what they look like. When the network is more complex than just one layer it can be a difficult to visulize the results like this. However, in our case (classification of a digit) it happens by multiplying the initial image by a weight matrix allowing us to visualize the network weights with a bit of added logic.
Let's create a weight_tensor which will have a dimension of 784x10. This tensor can be obtained by calling the net.parameters() method. In this example, if we want to see if our number is 0 or not, we will multiply input digit by weight_tensor[0] and pass the result through a softmax normalization to get the answer. This results in the weight tensor elements somewhat resembling the average shape of the digit it classifies:

weight_tensor = next(net.parameters())

fig,ax = plt.subplots(1,10,figsize=(15,4))

for i,x in enumerate(weight_tensor):

ax[i].imshow(x.view(28,28).detach())

4) Takeaway:

Training a neural network in PyTorch can be programmed with a training loop. It may seem like a complicated process, but in real life we need to write it once, and we can then re-use this training code later without changing it.
We can see that a single-layer dense neural network shows relatively good performance, but we definitely want to get higher than 91% on accuracy! In the next unit, we will try to use multi-level perceptrons.

III. TRAINING A MULTI-LAYER PERCEPTRON:

In the previous unit, we have used one-layer dense neural network to classify handwritten digits. Now let's see if adding more layers will give us better performance in terms of accuracy.

As in the previous unit, we will use pytorchcv helper to load all functions that we have defined in the previous units.

import torch

import torch.nn as nn

import torchvision

import matplotlib.pyplot as plt

from torchinfo import summary

from pytorchcv import load_mnist, train, plot_results

load_mnist(batch_size=128)

1) Multi-layer perceptron:

In a multi-layer network, we will add one or more hidden layers.
This layer may contain any number of neurons, which will affect how powerful our neural network it, i.e. how many parameters will it have. The more parameters there are in the network - the more data we need to train it.
However, more is not always better. A number of parameters of a neural network should be chosen depending on the dataset size, to prevent overfitting.
An important thing to note here is the non-linear activation function layer, called ReLU. It is important to introduce those non-linear activation functions, because they are one of the reasons neural networks achieve high expressive power. Indeed, it can be demonstrated mathematically that if a network consisted just of a series of linear layers, it would essentially be equivalent to one linear layer. Thus inserting non-linear functions in between layers is important!

def plot_function(f,name=''):

plt.plot(range(-10,10), [f(torch.tensor(x,dtype=torch.float32)).item() for x in range(-10,10)])

plt.title(name)

plt.subplot(121)

plot_function(torch.relu,'ReLU')

plt.subplot(122)

plot_function(torch.sigmoid,'Sigmoid')

Our network can be defined in PyTorch in the following way, using Sequential syntax:

net = nn.Sequential(

nn.Flatten(),

nn.Linear(784,100), # 784 inputs, 100 outputs

nn.ReLU(), # Activation Function

nn.Linear(100,10), # 100 inputs, 10 outputs

nn.LogSoftmax(dim=0))

summary(net,input_size=(1,28,28))

Here we use torchsummary.summary() function to display a detailed layer-by-layer structure of a network with some other useful information. In particular, we can see the number of parameters of the network.
Let's train this multi-layered perceptron:

hist = train(net,train_loader,test_loader, epochs=5)

plot_results(hist)

Please note the following:
1. This network is more expressive than the one layered perceptron we have trained in the previous unit. Thus it achieves a much higher training accuracy and given sufficient large number of parameters - it can get to almost 100%
2. Once the validation accuracy stops increasing - it means that the model has reached its ability to generalize, and further training will likely to result in overfitting.

2) Class-based network definitions:

Defining models using a Sequential style as a list of layers seems very convenient but it is somewhat limited. At some point you may need to define more complex networks, which contain shared weights, or some non-linear connections between layers.

from torch.nn.functional import relu, log_softmax

class MyNet(nn.Module):

def __init__(self):

super(MyNet, self).__init__()

self.flatten = nn.Flatten()

self.hidden = nn.Linear(784,100)

self.out = nn.Linear(100,10)

def forward(self, x):

x = self.flatten(x)

x = self.hidden(x)

x = relu(x)

x = self.out(x)

x = log_softmax(x,dim=0)

return x

net = MyNet()

summary(net,input_size=(1,28,28))

You can see that the structure of a neural network is the same as with the Sequential-defined network, but the definition is more explicit. Our custom neural network is represented by a class inherited from torch.nn.Module class.
Class definition consists of two parts:
1. In the constructor (__init__) we define all layers that our network will have. Those layers are stored as internal variables of the class, and PyTorch will automatically know that parameters of those layers should be optimized when training. Internally, PyTorch uses parameters() method to look for all trainable parameters, and nn.Module will automatically collect all trainable parameters from all sub-modules.
2. We define the forward method that does the forward pass computation of our neural network. In our case, we start with a parameter tensor x, and explicitly pass it through all the layers and activation functions, starting from flatten, up to final linear layer out. When we apply our neural network to some input data x by writing out = net(x), the forward method is called.
In fact, Sequential networks are represented in a very similar manner, they just store a list of layers and apply them sequentially during the forward pass. Here we have a chance to represent this process more explicitly, which eventually gives us more flexibility. That is one of the reasons that using classes for neural network definition is a recommended and perferred practice.
Now we will train our network and make sure we get similar results as before:

hist = train(net,train_loader,test_loader,epochs=5)

plot_results(hist)

3) Takeaway:

Weakness of the Perceptrons: Multi-level networks can achieve higher accuracy than single-layer perceptron, however, they are not perfect for computer vision tasks. In images, there are some structural patterns that can help us classify an object regardless of it's position in the image, but perceptrons do not allow us to extract those patterns and look for them selectively.
In the next unit we will focus on special type of neural networks that can be used effectively for computer vision tasks.

IV. USE A CONVOLUTIONAL NEURAL NETWORK:

1) Convolutional neural networks:

In the previous unit we have learned how to define a multi-layered neural network using class definition, but those networks were generic, and not specialized for computer vision tasks.
In this unit we will learn about Convolutional Neural Networks (CNNs), which are specifically designed for computer vision.
Computer vision is different from generic classification, because when we are trying to find a certain object in the picture, we are scanning the image looking for some specific patterns and their combinations. For example, when looking for a cat, we first may look for horizontal lines, which can form whiskers, and then certain combination of whiskers can tell us that it is actually a picture of a cat. Relative position and presence of certain patterns is important, and not their exact position on the image.
To extract patterns, we will use the notion of convolutional filters. But first, let us load all dependencies and functions that we have defined in the previous units.

import torch

import torch.nn as nn

import torchvision

import matplotlib.pyplot as plt

from torchinfo import summary

import numpy as np

from pytorchcv import load_mnist, train, plot_results, plot_convolution, display_dataset

load_mnist(batch_size=128)

2) Convolutional filters:

Convolutional filters are small windows that run over each pixel of the image and compute weighted average of the neighboring pixels.
They are defined by matrices of weight coefficients. Let's see the examples of applying two different convolutional filters over our MNIST handwritten digits:

plot_convolution(torch.tensor([[-1.,0.,1.],[-1.,0.,1.],[-1.,0.,1.]]),'Vertical edge filter')

plot_convolution(torch.tensor([[-1.,-1.,-1.],[0.,0.,0.],[1.,1.,1.]]),'Horizontal edge filter')

First filter is called a vertical edge filter, and it is defined by the following matrix:

-1 0 1

When this filter goes over relatively uniform pixel field, all values add up to 0. However, when it encounters a vertical edge in the image, high spike value is generated. That's why in the images above you can see vertical edges represented by high and low values, while horizontal edges are averaged out.
An opposite thing happens when we apply horizontal edge filter - horizontal lines are amplified, and vertical are averaged out.
In classical computer vision, multiple filters were applied to the image to generate features, which then were used by machine learning algorithm to build a classifier. However, in deep learning we construct networks that learn the best convolutional filters to solve classification problem.

3) Convolutional layers:

Convolutional layers are defined using nn.Conv2d construction. We need to specify the following:
1. in_channels - number of input channels. In our case we are dealing with a grayscale image, thus number of input channels is 1.
2. out_channels - number of filters to use. We will use 9 different filters, which will give the network plenty of opportunities to explore which filters work best for our scenario.
3. kernel_size is the size of the sliding window. Usually 3x3 or 5x5 filters are used.
Simplest CNN will contain one convolutional layer. Given the input size 28x28, after applying nine 5x5 filters we will end up with a tensor of 9x24x24 (the spatial size is smaller, because there are only 24 positions where a sliding interval of length 5 can fit into 28 pixels).
After convolution, we flatten 9x24x24 tensor into one vector of size 5184, and then add linear layer, to produce 10 classes. We also use relu activation function in between layers.

class OneConv(nn.Module):

def __init__(self):

super(OneConv, self).__init__()

self.conv = nn.Conv2d(in_channels=1,out_channels=9,kernel_size=(5,5))

self.flatten = nn.Flatten()

self.fc = nn.Linear(5184,10)

def forward(self, x):

x = nn.functional.relu(self.conv(x))

x = self.flatten(x)

x = nn.functional.log_softmax(self.fc(x),dim=1)

return x

net = OneConv()

summary(net,input_size=(1,1,28,28))

You can see that this network contains around 50k trainable parameters, compared to around 80k in fully-connected multi-layered networks. This allows us to achieve good results even on smaller datasets, because convolutional networks generalize much better.

hist = train(net,train_loader,test_loader,epochs=5)

plot_results(hist)

As you can see, we are able to achieve higher accuracy, and much faster, compared to the fully-connected networks from previous unit.
We can also visualize the weights of our trained convolutional layers, to try and make some more sense of what is going on:

fig,ax = plt.subplots(1,9)

with torch.no_grad():

p = next(net.conv.parameters())

for i,x in enumerate(p):

ax[i].imshow(x.detach().cpu()[0,...])

ax[i].axis('off')

You can see that some of those filters look like they can recognize some oblique strokes, while others look pretty random.

4) Multi-layered CNNs and pooling layers:

First convolutional layers looks for primitive patterns, such as horizontal or vertical lines, but we can apply further convolutional layers on top of them to look for higher-level patterns, such as primitive shapes. Then more convolutional layers can combine those shapes into some parts of the picture, up to the final object that we are trying to classify.
When doing so, we may also apply one trick: reducing the spatial size of the image. Once we have detected there is a horizontal stoke within sliding 3x3 window, it is not so important at which exact pixel it occurred. Thus we can "scale down" the size of the image, which is done using one of the pooling layers:
1. Average Pooling takes a sliding window (for example, 2x2 pixels) and computes an average of values within the window
2. Max Pooling replaces the window with the maximum value. The idea behind max pooling is to detect a presence of a certain pattern within the sliding window.
Thus, in a typical CNN there would be several convolutional layers, with pooling layers in between them to decrease dimensions of the image. We would also increase the number of filters, because as patterns become more advanced - there are more possible interesting combinations that we need to be looking for.
Because of decreasing spatial dimensions and increasing feature/filters dimensions, this architecture is also called pyramid architecture.

class MultiLayerCNN(nn.Module):

def __init__(self):

super(MultiLayerCNN, self).__init__()

self.conv1 = nn.Conv2d(1, 10, 5)

self.pool = nn.MaxPool2d(2, 2)

self.conv2 = nn.Conv2d(10, 20, 5)

self.fc = nn.Linear(320,10)

def forward(self, x):

x = self.pool(nn.functional.relu(self.conv1(x)))

x = self.pool(nn.functional.relu(self.conv2(x)))

x = x.view(-1, 320)

x = nn.functional.log_softmax(self.fc(x),dim=1)

return x

net = MultiLayerCNN()

summary(net,input_size=(1,1,28,28))

Note a few things about this definition:
1. Instead of using Flatten layer, we are flattening the tensor inside forward function using view function. Since flattening layer does not have trainable weights, it is not essential that we create a separate layer instance within our class.
2. We use just one instance of pooling layer in our model, also because it does not contain any trainable parameters, and this one instance can be effectively reused.
3. The number of trainable parameters (~8.5K) is dramatically smaller than in previous cases. This happens because convolutional layers in general have few parameters, and dimensionality of the image before applying final dense layer is significantly reduced. Small number of parameters have positive impact on our models, because it helps to prevent overfitting even on smaller dataset sizes.

hist = train(net,train_loader,test_loader,epochs=5)

What you should probably observe is that we are able to achieve higher accuracy than with just one layer, and much faster - just with 1 or 2 epochs. It means that sophisticated network architecture needs much fewer data to figure out what is going on, and to extract generic patterns from our images.

5) Playing with real images from the CIFAR-10 dataset:

While our handwritten digit recognition problem may seem like a toy problem, we are now ready to do something more serious. Let's explore more advanced dataset of pictures of different objects, called CIFAR-10.
It contains 60k 32x32 images, divided into 10 classes.

transform = torchvision.transforms.Compose(

[torchvision.transforms.ToTensor(),

torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)

trainloader = torch.utils.data.DataLoader(trainset, batch_size=14, shuffle=True)

testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)

testloader = torch.utils.data.DataLoader(testset, batch_size=14, shuffle=False)

classes = ('plane', 'car', 'bird', 'cat',

'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

display_dataset(trainset,classes=classes)

A well-known architecture for CIFAR-10 is called LeNet, and has been proposed by Yann LeCun. It follows the same principles as we have outlined above, the main difference being 3 input color channels instead of 1.
We also do one more simplification to this model - we do not use log_softmax as output activation function, and just return the output of last fully-connected layer. In this case we can just use CrossEntropyLoss loss function to optimize the model.

class LeNet(nn.Module):

def __init__(self):

super(LeNet, self).__init__()

self.conv1 = nn.Conv2d(3, 6, 5)

self.pool = nn.MaxPool2d(2)

self.conv2 = nn.Conv2d(6, 16, 5)

self.conv3 = nn.Conv2d(16,120,5)

self.flat = nn.Flatten()

self.fc1 = nn.Linear(120,64)

self.fc2 = nn.Linear(64,10)

def forward(self, x):

x = self.pool(nn.functional.relu(self.conv1(x)))

x = self.pool(nn.functional.relu(self.conv2(x)))

x = nn.functional.relu(self.conv3(x))

x = self.flat(x)

x = nn.functional.relu(self.fc1(x))

x = self.fc2(x)

return x

net = LeNet()

summary(net,input_size=(1,3,32,32))

Training this network properly will take significant amount of time, and should preferably be done on GPU-enabled compute.

opt = torch.optim.SGD(net.parameters(),lr=0.001,momentum=0.9)

hist = train(net, trainloader, testloader, epochs=3, optimizer=opt, loss_fn=nn.CrossEntropyLoss())

The accuracy that we have been able to achieve with 3 epochs of training does not seem great. However, remember that blind guessing would only give us 10% accuracy, and that our problem is actually significantly more difficult than MNIST digit classification. Getting above 50% accuracy in such a short training time seems like a good accomplishment.

6) Takeaways:

In this unit, we have learned the main concept behind computer vision neural networks - convolutional networks. Real-life architectures that power image classification, object detection, and even image generation networks are all based on CNNs, just with more layers and some additional training tricks.

V. USE A PRE-TRAINED NETWORK WITH TRANSFER LEARNING:

1) Pre-trained models and transfer learning:

Training CNNs can take a lot of time, and a lot of data is required for that task. However, much of the time is spent to learn the best low-level filters that a network is using to extract patterns from images. A natural question arises - can we use a neural network trained on one dataset and adapt it to classifying different images without full training process?
This approach is called transfer learning, because we transfer some knowledge from one neural network model to another. In transfer learning, we typically start with a pre-trained model, which has been trained on some large image dataset, such as ImageNet. Those models can already do a good job extracting different features from generic images, and in many cases just building a classifier on top of those extracted features can yield a good result.

import torch

import torch.nn as nn

import torchvision

import torchvision.transforms as transforms

import matplotlib.pyplot as plt

from torchinfo import summary

import numpy as np

import os

from pytorchcv import train, plot_results, display_dataset, train_long, check_image_dir

2) Cats vs. Dogs Dataset:

In this unit, we will solve a real-life problem of classifying images of cats and dogs. For this reason, we will use Kaggle Cats vs. Dogs Dataset, which can also be downloaded from Microsoft.
Let's download this dataset and extract it into data directory (this process may take some time!):

if not os.path.exists('data/kagglecatsanddogs_3367a.zip'):

!wget -P data -q https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip

import zipfile

if not os.path.exists('data/PetImages'):

with zipfile.ZipFile('data/kagglecatsanddogs_3367a.zip', 'r') as zip_ref:

zip_ref.extractall('data')

Unfortunately, there are some corrupt image files in the dataset. We need to do quick cleaning to check for corrupted files. In order not to clobber this tutorial, we moved the code to verify dataset into a module.

check_image_dir('data/PetImages/Cat/*.jpg')

check_image_dir('data/PetImages/Dog/*.jpg')

Next, let's load the images into PyTorch dataset, converting them to tensors and doing some normalization. We will apply std_normalize transform to bring images to the range expected by pre-trained VGG network:

std_normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],

std=[0.229, 0.224, 0.225])

trans = transforms.Compose([

transforms.Resize(256),

transforms.CenterCrop(224),

transforms.ToTensor(),

std_normalize])

dataset = torchvision.datasets.ImageFolder('data/PetImages',transform=trans)

trainset, testset = torch.utils.data.random_split(dataset,[20000,len(dataset)-20000])

display_dataset(dataset)

3) Pre-trained models:

There are many different pre-trained models available inside torchvision module, and even more models can be found on the Internet. Let's see how simplest VGG-16 model can be loaded and used:

vgg = torchvision.models.vgg16(pretrained=True)

sample_image = dataset[0][0].unsqueeze(0)

res = vgg(sample_image)

print(res[0].argmax())

The result that we have received is a number of an ImageNet class, which can be looked up here. We can use the following code to automatically load this class table and return the result:

import json, requests

class_map = json.loads(requests.get("https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json").text)

class_map = { int(k) : v for k,v in class_map.items() }

class_map[res[0].argmax().item()]

Let's also see the architecture of the VGG-16 network:

summary(vgg,input_size=(1,3,224,224))

In addition to the layer we already know, there is also another layer type called Dropout. These layers act as a regularization technique. Regularization makes slight modifications to the learning algorithm so the model generalizes better. During training, dropout layers discard some proportion (around 30%) of the neurons in the previous layer, and training happens without them. This helps to get the optimization process out of local minima, and to distribute decisive power between different neural paths, which improves the overall stability of the network.

4) GPU computations:

Deep neural networks, such as VGG-16 and other more modern architectures require quite a lot of computational power to run. It makes sense to use GPU acceleration, if it is available. In order to do so, we need to explicitly move all tensors involved in the computation to GPU.
The way it is normally done is to check the availability of GPU in the code, and define device variable that points to the computational device - either GPU or CPU.

device = 'cuda' if torch.cuda.is_available() else 'cpu'

print('Doing computations on device = {}'.format(device))

vgg.to(device)

sample_image = sample_image.to(device)

vgg(sample_image).argmax()

5) Extracting VGG features:

If we want to use VGG-16 to extract features from our images, we need the model without final classification layers. In fact, this "feature extractor" can be obtained using vgg.features method:

res = vgg.features(sample_image).cpu()

plt.figure(figsize=(15,3))

plt.imshow(res.detach().view(-1,512))

print(res.size())

The dimension of feature tensor is 512x7x7, but in order to visualize it we had to reshape it to 2D form.
Now let's try to see if those features can be used to classify images. Let's manually take some portion of images (800 in our case), and pre-compute their feature vectors. We will store the result in one big tensor called feature_tensor, and also labels into label_tensor:

bs = 8

dl = torch.utils.data.DataLoader(dataset,batch_size=bs,shuffle=True)

num = bs*100

feature_tensor = torch.zeros(num,512*7*7).to(device)

label_tensor = torch.zeros(num).to(device)

i = 0

for x,l in dl:

with torch.no_grad():

f = vgg.features(x.to(device))

feature_tensor[i:i+bs] = f.view(bs,-1)

label_tensor[i:i+bs] = l

i+=bs

print('.',end='')

if i>=num:

break

Now we can define vgg_dataset that takes data from this tensor, split it into training and test sets using random_split function, and train a small one-layer dense classifier network on top of extracted features:

vgg_dataset = torch.utils.data.TensorDataset(feature_tensor,label_tensor.to(torch.long))

train_ds, test_ds = torch.utils.data.random_split(vgg_dataset,[700,100])

train_loader = torch.utils.data.DataLoader(train_ds,batch_size=32)

test_loader = torch.utils.data.DataLoader(test_ds,batch_size=32)

net = torch.nn.Sequential(torch.nn.Linear(512*7*7,2),torch.nn.LogSoftmax()).to(device)

history = train(net,train_loader,test_loader)

The result is great, we can distinguish between a cat and a dog with almost 98% probability! However, we have only tested this approach on a small subset of all images, because manual feature extraction seems to take a lot of time.

5) Transfer learning using one VGG network:

We can also avoid manually pre-computing the features by using the original VGG-16 network as a whole during training. Let's look at the VGG-16 object structure:

print(vgg)

You can see that the network contains:
1. feature extractor (features), comprised of a number of convolutional and pooling layers
2. average pooling layer (avgpool)
3. final classifier, consisting of several dense layers, which turns 25088 input features into 1000 classes (which is the number of classes in ImageNet)
To train the end-to-end model that will classify our dataset, we need to:
1. replace the final classifier with the one that will produce required number of classes. In our case, we can use one Linear layer with 25088 inputs and 2 output neurons.
2. freeze weights of convolutional feature extractor, so that they are not trained. It is recommended to initially do this freezing, because otherwise untrained classifier layer can destroy the original pre-trained weights of convolutional extractor. Freezing weights can be accomplished by setting requires_grad property of all parameters to False

vgg.classifier = torch.nn.Linear(25088,2).to(device)

for x in vgg.features.parameters():

x.requires_grad = False

summary(vgg,(1, 3,244,244))

As you can see from the summary, this model contain around 15 million total parameters, but only 50k of them are trainable - those are the weights of classification layer. That is good, because we are able to fine-tune smaller number of parameters with smaller number of examples.
Now let's train the model using our original dataset. This process will take a long time, so we will use train_long function that will print some intermediate results without waiting for the end of epoch. It is highly recommended to run this training on GPU-enabled compute!

trainset, testset = torch.utils.data.random_split(dataset,[20000,len(dataset)-20000])

train_loader = torch.utils.data.DataLoader(trainset,batch_size=16)

test_loader = torch.utils.data.DataLoader(testset,batch_size=16)

train_long(vgg,train_loader,test_loader,loss_fn=torch.nn.CrossEntropyLoss(),epochs=1,print_freq=90)

It looks like we have obtained reasonably accurate cats vs. dogs classifier! Let's save it for future use!

torch.save(vgg,'data/cats_dogs.pth')

We can then load the model from file at any time. You may find it useful in case the next experiment destroys the model - you would not have to re-start from scratch.

vgg = torch.load('data/cats_dogs.pth')

6) Fine-tuning transfer learning:

In the previous section, we have trained the final classifier layer to classify images in our own dataset. However, we did not re-train the feature extractor, and our model relied on the features that the model has learned on ImageNet data. If your objects visually differ from ordinary ImageNet images, this combination of features might not work best. Thus it makes sense to start training convolutional layers as well.
To do that, we can unfreeze the convolutional filter parameters that we have previously frozen.
Note: It is important that you freeze parameters first and perform several epochs of training in order to stabilize weights in the classification layer. If you immediately start training end-to-end network with unfrozen parameters, large errors are likely to destroy the pre-trained weights in the convolutional layers.

for x in vgg.features.parameters():

x.requires_grad = True

After unfreezing, we can do a few more epochs of training. You can also select lower learning rate, in order to minimize the impact on the pre-trained weights. However, even with low learning rate, you can expect the accuracy to drop in the beginning of the training, until finally reaching slightly higher level than in the case of fixed weights.
Note: This training happens much slower, because we need to propagate gradients back through many layers of the network! You may want to watch the first few minibatches to see the tendency, and then stop the computation.

train_long(vgg,train_loader,test_loader,loss_fn=torch.nn.CrossEntropyLoss(),epochs=1,print_freq=90,lr=0.0001)

7) Other computer vision models:

VGG-16 is one of the simplest computer vision architectures. torchvision package provides many more pre-trained networks. The most frequently used ones among those are ResNet architectures, developed by Microsoft, and Inception by Google. For example, let's explore the architecture of the simplest ResNet-18 model (ResNet is a family of models with different depth, you can try experimenting with ResNet-151 if you want to see what a really deep model looks like):

resnet = torchvision.models.resnet18()

print(resnet)

As you can see, the model contains the same building blocks: feature extractor and final classifier (fc). This allows us to use this model in exactly the same manner as we have been using VGG-16 for transfer learning. You can try experimenting with the code above, using different ResNet models as the base model, and see how accuracy changes.

8) Batch Normalization:

This network contains yet another type of layer: Batch Normalization.
The idea of batch normalization is to bring values that flow through the neural network to right interval.
Usually neural networks work best when all values are in the range of -1,1 or 0,1, and that is the reason that we scale/normalize our input data accordingly. However, during training of a deep network, it can happen that values get significantly out of this range, which makes training problematic. Batch normalization layer computes average and standard deviation for all values of the current minibatch, and uses them to normalize the signal before passing it through a neural network layer. This significantly improves the stability of deep networks.

9) Takeaway:

Using transfer learning, we were able to quickly put together a classifier for our custom object classification task, and achieve high accuracy. However, this example was not completely fair, because original VGG-16 network was pre-trained to recognize cats and dogs, and thus we were just reusing most of the patterns that were already present in the network. You can expect lower accuracy on more exotic domain-specific objects, such as details on production line in a plant, or different tree leaves.
You can see that more complex tasks that we are solving now require higher computational power, and cannot be easily solved on the CPU. In the next unit, we will try to use more lightweight implementation to train the same model using lower compute resources, which results in just slightly lower accuracy.

VI. SOLVING VISION PROBLEMS WITH MOBILENET:

1) Lightweight networks and MobileNet:

We have seen that complex networks require significant computational resources, such as GPU, for training, and also for fast inference. However, it turns out that a model with significantly smaller number of parameters in most cases can still be trained to perform reasonably well. In other words, increase in the model complexity typically results in small (non-proportional) increase in the model performance.
We have observed this in the beginning of the module when training MNIST digit classification. The accuracy of simple dense model was not significantly worse than that of a powerful CNN. Increasing the number of CNN layers and/or number of neurons in the classifier allowed us to gain a few percents of accuracy at most.
This leads us to the idea that we can experiment with Lightweight network architectures in order to train faster models. This is especially important if we want to be able to execute our models on mobile devices.
This module will rely on the Cats and Dogs dataset that we have downloaded in the previous unit. First we will make sure that the dataset is available.

import torch

import torch.nn as nn

import torchvision

import matplotlib.pyplot as plt

from torchinfo import summary

import os

from pytorchcv import train, display_dataset, train_long, load_cats_dogs_dataset, validate, common_transform

if not os.path.exists('data/kagglecatsanddogs_3367a.zip'):

!wget -P data -q https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip

dataset, train_loader, test_loader = load_cats_dogs_dataset()

2) MobileNet:

In the previous unit, we have seen ResNet architecture for image classification. More lightweight analog of ResNet is MobileNet, which uses so-called Inverted Residual Blocks. Let's load pre-trained mobilenet and see how it works:

model = torch.hub.load('pytorch/vision:v0.6.0', 'mobilenet_v2', pretrained=True)

model.eval()

print(model)

Let's apply the model to our dataset and make sure that it works.

sample_image = dataset[0][0].unsqueeze(0)

res = model(sample_image)

print(res[0].argmax())

3) Using MobileNet for transfer learning:

Now let's perform the same transfer learning process as in previous unit, but using MobileNet.
First of all, let's freeze all parameters of the model:

for x in model.parameters():

x.requires_grad = False

Then, replace the final classifier. We also transfer the model to our default training device (GPU or CPU):

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model.classifier = nn.Linear(1280,2)

model = model.to(device)

summary(model,input_size=(1,3,244,244))

Now let's do the actual training:

train_long(model,train_loader,test_loader,loss_fn=torch.nn.CrossEntropyLoss(),epochs=1,print_freq=90)

4) Takeaway:

Notice that MobileNet results in almost the same accuracy as VGG-16, and just slightly lower than full-scale ResNet.
The main advantage of small models, such as MobileNet or ResNet-18 is that they can be used on mobile devices.

References:

Page updated

Google Sites

Report abuse

This site uses cookies from Google to deliver its services and to analyze traffic. Information about your use of this site is shared with Google. By clicking "accept", you agree to its use of cookies. Cookie Policy

Reject

Introduction to Computer Vision with PyTorch

About Me: