[LG-Transformer] Local-to-Global Self-Attention in Vision Transformers
{SW-MSA; Local-Global-Attention}
0) Motivation, Object and Related works:
Motivation:
Transformers have demonstrated great potential in computer vision tasks.
To avoid dense computations of self-attentions in high-resolution visual data, some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
Objectives:
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
The proposed framework is computationally efficient and highly effective.
With a marginal increasement in computational overhead, our model achieves notable improvements in both image classification and semantic segmentation
Multi-head self-attention with shifted window partitioning (SW-MSA)