transformer blocks
30 Dec 2022 - 17 Jan 2026
- [[1706.03762] Attention Is All You Need](https://arxiv.org/abs/1706.03762)
- Recurrent networks (including LSTM) are state of the art (in 2017). This proposes flushing them and replacing with nothing but attention, specifically the Transformer.
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely
In these [convolutional] models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet
- would be interesting to know the topology of those networks (I don't).
- Sequence prediction has obvious problems of non-parallizablity, which all these things aim to address somehow.
Earlier designs implemented the attention mechanism in a serial recurrent neural network (RNN) language translation system, but a more recent design, namely the transformer), removed the slower sequential RNN and relied more heavily on the faster parallel attention scheme.
- self-attention (intra-attention)