Scaling Local Self-Attention for Parameter Efficient Visual Backbones
{Hybrid Transformer, Local Feature, Self-Training, Self-Attention}
{Hybrid Transformer, Local Feature, Self-Training, Self-Attention}
0) Motivation, Object and Related works:
Motivation:
Self-attention has the promise of improving computer vision systems due to parameter-independent scaling of receptive fields and content-dependent interactions, in contrast to parameter-dependent scaling and content-independent interactions of convolutions.
Self-attention models have recently been shown to have encouraging improvements on accuracy-parameter trade-offs compared to baseline convolutional models such as ResNet-50.
Objectives:
Develop self-attention models that can outperform not just the canonical baseline models, but even the high-performing convolutional models.
We propose two extensions to self-attention that, in conjunction with a more efficient implementation of self-attention, improve the speed, memory usage, and accuracy of these models.
New self-attention model family, HaloNets, which reach state-of-the-art accuracies on the parameter-limited setting of the ImageNet classification benchmark.
Outperform much larger models and have better inference performance.
On harder tasks such as object detection and instance segmentation, our simple local self-attention and convolutional hybrids show improvements over very strong baselines.
These results mark another step in demonstrating the efficacy of self-attention models on settings traditionally dominated by convolutional models.