About

AMMDI is an open-notebook hypertext writing experiment, authored by Mike Travers aka mtraven. It's a work in progress and some parts are more polished than others. Comments welcome! More.

Search

MapFull

Incoming links

from Language Models

The input vectors are then passed through a series of transformer blocks. Each block consists of a self-attention layer and a feed-forward layer. The self-attention layer allows the model to consider the relationships between different tokens in the input, while the feed-forward layer transforms the input using a learned function.

from Language Models

transformer blocks

transformer blocks

30 Dec 2022 - 17 Jan 2026

[[1706.03762] Attention Is All You Need](https://arxiv.org/abs/1706.03762)

Recurrent networks (including LSTM) are state of the art (in 2017). This proposes flushing them and replacing with nothing but attention, specifically the Transformer.

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely

In these [convolutional] models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet

would be interesting to know the topology of those networks (I don't).

Sequence prediction has obvious problems of non-parallizablity, which all these things aim to address somehow.

Earlier designs implemented the attention mechanism in a serial recurrent neural network (RNN) language translation system, but a more recent design, namely the transformer), removed the slower sequential RNN and relied more heavily on the faster parallel attention scheme.

Attention (machine learning) - Wikipedia

self-attention (intra-attention)