Deep Clustering for Unsupervised Learning of Visual Features
{, }
{, }
DeepCluster combines two pieces: unsupervised clustering and deep neural networks. It proposes an end-to-end method to jointly learn parameters of a deep neural network and the cluster assignments of its representations. The features are generated and clustered iteratively to get both a trained model and labels as output artifacts.
Unlabeled images are taken and augmentations are applied to them.
An ConvNet architecture such as AlexNet or VGG-16 is used as the feature extractor. Initially, the ConvNet is initialized with random weights and we take the feature vector from layer before the final classification head.
PCA is used to reduce the dimension of the feature vector along with whitening and L2 normalization.
The processed features are passed to K-means to get a cluster assignment for each image.
These cluster assignments are used as pseudo-labels and the ConvNet is trained to predict these clusters.
Cross-entropy loss is used to gauge the performance of the model. The model is trained for 100 epochs with the clustering step occurring once per epoch.
We can take the representations learned and use it for downstream tasks.
As seen in the figure above, unlabeled images are taken and augmentations are applied to them. Then, an ConvNet architecture such as AlexNet or VGG-16 is used as the feature extractor. Initially, the ConvNet is initialized with randomly weights and we take the feature vector from layer before the final classification head. Then, PCA is used to reduce the dimension of the feature vector along with whitening and L2 normalization. Finally, the processed features are passed to K-means to get cluster assignment for each image.
These cluster assignments are used as the pseudo-labels and the ConvNet is trained to predict these clusters. Cross-entropy loss is used to gauge the performance of the model. The model is trained for 100 epochs with the clustering step occurring once per epoch. Finally, we can take the representations learned and use it for downstream tasks.
Training Data
Data Augmentation
Transformation when doing clustering.
Transformation when training model.
Decide Number of Clusters.
Model Architecture.
Generating the Initial Labels.
Clustering.
Representation Learning.
Switching between Model Training and Clustering.
When model representations are to be sent for clustering, random augmentations are not used.
The image is simply resized to 256*256 and the center crop is applied to get 224*224 image. Then normalization is applied.
from PIL import Image
import torchvision.transforms as transforms
im = Image.open('dog.png')
t = transforms.Compose([transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])])
aug_im = t(im)
When the model is trained on image and labels, then we use random augmentations.
The image is cropped to a random size and aspect ratio and then resized to 224*224. Then, the image is horizontally flipped with a 50% chance. Finally, we normalize the image with ImageNet mean and std.
from PIL import Image
import torchvision.transforms as transforms
im = Image.open('dog.png')
t = transforms.Compose([transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])])
aug_im = t(im)
When the model is trained on image and labels, then we use random augmentations.
The image is cropped to a random size and aspect ratio and then resized to 224*224. Then, the image is horizontally flipped with a 50% chance. Finally, we normalize the image with ImageNet mean and std.
import torch
import torch.nn as nn
# Fill kernel of Conv2d layer with grayscale kernel
grayscale = nn.Conv2d(3, 1, kernel_size=1, stride=1, padding=0)
grayscale.weight.data.fill_(1.0 / 3.0)
grayscale.bias.data.zero_()
# Fill kernel of Conv2d layer with sobel kernels
sobel = nn.Conv2d(1, 2, kernel_size=3, stride=1, padding=1)
sobel.weight.data[0, 0].copy_(
torch.FloatTensor([[1, 0, -1],
[2, 0, -2],
[1, 0, -1]])
)
sobel.weight.data[1, 0].copy_(
torch.FloatTensor([[1, 2, 1],
[0, 0, 0],
[-1, -2, -1]])
)
sobel.bias.data.zero_()
# Combine the two
combined = nn.Sequential(grayscale, sobel)
# Apply
batch_image = aug_im.unsqueeze(dim=0)
sobel_im = combined(batch_image)
ImageNet
n2 n0
θ