FNet: Mixing Tokens with Fourier Transforms

This paper replaces the self-attention layer with two Fourier transforms to mix tokens. This is a 7X more efficient than self-attention. The accuracy loss of using this over self-attention is about 92% for BERT on GLUE benchmark.

[Paper] [Code]

FNet: Mixing Tokens with Fourier Transforms

About Me: