[T2T-ViT] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

{Aggregating neighboring Tokens; deep-narrow structure backbone}

Paper: https://arxiv.org/pdf/2101.11986.pdf