Scaling Local Self-Attention for Parameter Efficient Visual Backbones

{Hybrid Transformer, Local Feature, Self-Training, Self-Attention}

[Paper]