Deep Clustering for Unsupervised Learning of Visual Features

{, }

Paper: https://arxiv.org/pdf/1807.05520.pdf

Code: https://github.com/facebookresearch/deepcluster

Motivation, Objectives and Related Works

Motivation

Objectives

DeepCluster combines two pieces: unsupervised clustering and deep neural networks. It proposes an end-to-end method to jointly learn parameters of a deep neural network and the cluster assignments of its representations. The features are generated and clustered iteratively to get both a trained model and labels as output artifacts.
Unlabeled images are taken and augmentations are applied to them.
An ConvNet architecture such as AlexNet or VGG-16 is used as the feature extractor. Initially, the ConvNet is initialized with random weights and we take the feature vector from layer before the final classification head.
PCA is used to reduce the dimension of the feature vector along with whitening and L2 normalization.
The processed features are passed to K-means to get a cluster assignment for each image.
These cluster assignments are used as pseudo-labels and the ConvNet is trained to predict these clusters.
Cross-entropy loss is used to gauge the performance of the model. The model is trained for 100 epochs with the clustering step occurring once per epoch.
We can take the representations learned and use it for downstream tasks.

Related Works

Model

Idea

As seen in the figure above, unlabeled images are taken and augmentations are applied to them. Then, an ConvNet architecture such as AlexNet or VGG-16 is used as the feature extractor. Initially, the ConvNet is initialized with randomly weights and we take the feature vector from layer before the final classification head. Then, PCA is used to reduce the dimension of the feature vector along with whitening and L2 normalization. Finally, the processed features are passed to K-means to get cluster assignment for each image.
These cluster assignments are used as the pseudo-labels and the ConvNet is trained to predict these clusters. Cross-entropy loss is used to gauge the performance of the model. The model is trained for 100 epochs with the clustering step occurring once per epoch. Finally, we can take the representations learned and use it for downstream tasks.

Steps

Training Data
Data Augmentation
- Transformation when doing clustering.
- Transformation when training model.
Decide Number of Clusters.
Model Architecture.
Generating the Initial Labels.
Clustering.
Representation Learning.
Switching between Model Training and Clustering.

Model

Data Augmentation

When model representations are to be sent for clustering, random augmentations are not used.
The image is simply resized to 256*256 and the center crop is applied to get 224*224 image. Then normalization is applied.

from PIL import Image

import torchvision.transforms as transforms

im = Image.open('dog.png')

t = transforms.Compose([transforms.Resize(256),

transforms.CenterCrop(224),

transforms.ToTensor(),

transforms.Normalize(mean=[0.485, 0.456, 0.406],

std=[0.229, 0.224, 0.225])])

aug_im = t(im)

When the model is trained on image and labels, then we use random augmentations.
The image is cropped to a random size and aspect ratio and then resized to 224*224. Then, the image is horizontally flipped with a 50% chance. Finally, we normalize the image with ImageNet mean and std.

from PIL import Image

import torchvision.transforms as transforms

im = Image.open('dog.png')

t = transforms.Compose([transforms.Resize(256),

transforms.CenterCrop(224),

transforms.ToTensor(),

transforms.Normalize(mean=[0.485, 0.456, 0.406],

std=[0.229, 0.224, 0.225])])

aug_im = t(im)

When the model is trained on image and labels, then we use random augmentations.

The image is cropped to a random size and aspect ratio and then resized to 224*224. Then, the image is horizontally flipped with a 50% chance. Finally, we normalize the image with ImageNet mean and std.

import torch

import torch.nn as nn

# Fill kernel of Conv2d layer with grayscale kernel

grayscale = nn.Conv2d(3, 1, kernel_size=1, stride=1, padding=0)

grayscale.weight.data.fill_(1.0 / 3.0)

grayscale.bias.data.zero_()

# Fill kernel of Conv2d layer with sobel kernels

sobel = nn.Conv2d(1, 2, kernel_size=3, stride=1, padding=1)

sobel.weight.data[0, 0].copy_(

torch.FloatTensor([[1, 0, -1],

[2, 0, -2],

[1, 0, -1]])

)

sobel.weight.data[1, 0].copy_(

torch.FloatTensor([[1, 2, 1],

[0, 0, 0],

[-1, -2, -1]])

)

sobel.bias.data.zero_()

# Combine the two

combined = nn.Sequential(grayscale, sobel)

# Apply

batch_image = aug_im.unsqueeze(dim=0)

sobel_im = combined(batch_image)

Loss Function

Training Strategy

Experimental Results

Dataset

ImageNet

Metrics

Experimental Results

Ablations

Key Takeaways

References

https://amitness.com/2020/04/deepcluster/

- n2 n0
- θ

Deep Clustering for Unsupervised Learning of Visual Features

Motivation, Objectives and Related Works

Motivation

Objectives

Related Works

Model

Idea

Steps

Model

Data Augmentation

Loss Function

Training Strategy

Experimental Results

Dataset

Metrics

Experimental Results

Ablations

Key Takeaways

References

About Me: