FNet: Mixing Tokens with Fourier Transforms
This paper replaces the self-attention layer with two Fourier transforms to mix tokens. This is a 7X more efficient than self-attention. The accuracy loss of using this over self-attention is about 92% for BERT on GLUE benchmark.