transformer blocks

30 Dec 2022 03:50 - 28 Nov 2023 05:18
Open in Logseq
    • [[1706.03762] Attention Is All You Need](https://arxiv.org/abs/1706.03762)
      • Recurrent networks (including LSTM) are state of the art (in 2017). This proposes flushing them and replacing with nothing but attention mechanisms, specifically the Transformer.
      • In these [convolutional] models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet
      • would be interesting to know the topology of those networks (I don't).
      • Sequence prediction has obvious problems of non-parallizablity, which all these things aim to address somehow.
      • self-attention (intra-attention)