Attention is All You Need

Transformers are generic, simples and exciting architectures designed to process a connected set of units (tokens in a sequence, pixels in an image, etc.) where the only interaction between units is through self-attention: an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence.

CNNs only look for dependencies in and around the pixel window which focuses on the local context whereas Transformers have the bigger picture because of pre-tokenization embeddings which focus on the global context.

[Paper] [Code]