Understanding Transformer Attention

By Frederick Lowe, Jan 8, 2025

Most of what folks call "AI" now revolves around Generative Pre-trained Transformers. A key concept in understanding how GPTs work is to understand the concept of "Attention".

The computational cost of leveraging a Transformer's full representational capability is O(n²) with respect to the length of the input sequence. Quadratic. If you're not an engineer, a simpler way to understand this is that for each token (or word) in your prompt, the model calculates attention scores for every other token. These scores determine how much each token should "attend to" to every other token.

The "magic" of LLM response coherence emerges from those statistical relationships.

If math isn't your thing, think of it this way: imagine being asked to remember every letter in a sequence in relation to every other letter. If I give you two letters, "a" and "b," you only need to track two relationships: "ab" and "ba." But if I give you three letters , "a," "b," and "c" , you now have six relationships to consider: "ab," "ac," "ba," "bc," "ca," and "cb." For every new letter I add, the number of attention computations grows rapidly. Grossly, a moderately complex 1800-token prompt requires 3.24 million attention computations. In practice, use of triangular matrices for causal masking reduces actual computations to roughly half that number.

Modern inference strategies (LongFormer’s Sliding Window Attention, O(n × w), BigBird's Block Sparse Attention, O(n)) reduce computational costs by constraining the model's representational capacity.

Positional encoding techniques like RoPE (Rotary Position Embedding) and ALiBi (Attention with Linear Biases) help maintain long-range modeling capability within these constrained attention patterns. The fundamental reality remains: sparse attention involves some loss of representational capacity, though for most practical applications this doesn't meaningfully impact output quality.

You can't squeeze O(n²) juice from O(n) lemons!

-Rudolf Clausius hot mic gaffe on panel with Dario Amodei, maybe

Takeaway

Longer prompts require more computation for the Transformer to process. Emerging inference strategies reduce the computational complexity at the expense of representational capacity. Positional encoding techniques make the loss more theoretical than practical for most use cases.