[LG-Transformer] Local-to-Global Self-Attention in Vision Transformers

{SW-MSA; Local-Global-Attention}

0) Motivation, Object and Related works:

Motivation:

Transformers have demonstrated great potential in computer vision tasks.
To avoid dense computations of self-attentions in high-resolution visual data, some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.

Objectives:

In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
The proposed framework is computationally efficient and highly effective.
With a marginal increasement in computational overhead, our model achieves notable improvements in both image classification and semantic segmentation

Multi-head self-attention with shifted window partitioning (SW-MSA)

Page updated

Google Sites

Report abuse