[FAVOR+] Rethinking attention with performers
{}
0) Motivation, Object and Related works:
Motivation:
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness.
Objectives:
To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
Motivation:
FAVOR+, standing for fast attention via positive orthogonal random feature, was proposed in [7] as a key module of a Transformer architecture named Performer. FAVOR+ aims to approximate the regular attention with a linear time and space complexity. The OR+ part in FAVOR+ was done by projecting the queries and keys onto a positive orthogonal random feature space. The FA-part in FAVOR+ was done by changing the order of computation in the attention module, shown in Fig. 6. In a regular attention module, the query and the key matrices are firstly multiplied (which requires quadratic time and space complexity), followed by multiplication with the value matrix. FAVOR+, on the other hand, approximates the regular attention by firstly multiplying the key with the value matrices, followed by the left-multiplication with the query matrix. This results in linear time and space complexity, as shown in Fig. 6.
The Performer architecture, which exploits FAVOR+ inside, was also explored in the T2T-ViT [6] and was found competitive in the performance, compared with the original Transformer, while reducing the computation cost.
References:
https://sertiscorp.medium.com/vision-transformers-a-review-part-ii-a31136cf848d