Understanding Transformer Attention
By Frederick Lowe, Jan 8, 2025

Most of what folks call "AI" now runs on Transformer-based architectures. GPTs (Generative Pre-trained Transformers) are the most familiar variant; the family also includes encoder-only models like BERT and encoder-decoder models like T5. A key concept in understanding how any of them work is "Attention" — and a key concept in understanding why they're expensive to run is the cost of attention plus a few other architectural features that travel with it.
This essay covers both. Attention first.
What Attention Costs
The computational cost of full attention is O(n²) with respect to the length of the input sequence. Quadratic. For each token in the input, the model calculates attention scores against every other token. Those scores determine how much each token should "attend to" every other token, and the statistical relationships they encode are where LLM response coherence emerges.
If math isn't your thing, think of it this way: imagine being asked to track every letter in a sequence in relation to every other letter. Two letters, "a" and "b," gives you two relationships: "ab" and "ba." Three letters — "a," "b," "c" — gives you six: "ab," "ac," "ba," "bc," "ca," "cb." For every new letter, the count grows as the square. A moderately complex 1800-token prompt requires roughly 3.24 million attention computations. Causal masking (triangular matrices, used during training and prefill in autoregressive models) reduces this to about half in practice.
Sparse attention strategies — Longformer's Sliding Window Attention, O(n × w) and BigBird's Block Sparse Attention, O(n) — reduce attention compute by limiting which tokens can attend to which. Positional encoding techniques like RoPE and ALiBi help maintain long-range modeling capability within those constrained patterns.
Sparse attention does involve some loss of representational capacity, but for most practical applications the loss isn't meaningfully visible in output quality.
You can't squeeze O(n²) juice from O(n) lemons.
- Rudolf Clausius hot mic gaffe on panel with Dario Amodei, maybe
The joke is fun. It's also slightly unfair to sparse attention, which delivers most of the juice from most of the lemons in most cases.
What Else Costs at Inference Time
Attention is the headline cost, but it isn't the whole story for inference latency. A few other architectural features matter at least as much.
Prefill vs. decode. Processing a prompt and generating a response are two different phases. Prefill runs the prompt through the model in parallel — this is where O(n²) attention bites hardest. Decode generates the response one token at a time, with each new token requiring a full forward pass through the model. With KV caching (storing prior keys and values so they don't recompute), decode is O(n) per token rather than O(n²) — but it's still token-by-token and inherently serial.
Model depth. A Transformer is a stack of layers, and you can't skip layers. Every forward pass walks through every layer in sequence. Depth contributes a fixed wall-clock floor that adding parallelism within layers does not reduce.
Autoregressive generation. Each output token depends on the previous one. You can't generate token 50 until you've generated token 49. Speculative decoding partially mitigates this (predict ahead, verify, discard wrong predictions), but the underlying serial dependency is structural.
Memory bandwidth. On most inference hardware, the actual bottleneck isn't compute — it's moving model weights and the KV cache between memory and compute units. This is why batch size matters for throughput, and why single-query latency is often hardware-limited regardless of how clever the attention strategy is.
Why This Matters
Pull the threads together: long prompts cost O(n²) for prefill. Every output token requires a serial forward pass through every layer of the model. Autoregressive generation can't be fully parallelized. Memory bandwidth limits throughput per query.
Sparse attention, KV caching, and speculative decoding each chip away at one piece of this. None of them eliminate the architectural floor.
For long-form generation, the floor is fine — users tolerate a few seconds for a thoughtful response. For hard-realtime applications — sub-100ms decisions in ad bidding, robotics, autonomous vehicles, grid switching — it's a hard ceiling. Scaling LLMs improves capability per inference. It does not reduce the inference latency floor.
Takeaway
Attention is O(n²) in sequence length and dominates prefill cost. Sparse attention strategies trade a small representational cost for substantial compute savings. Inference latency, though, is set by more than attention: model depth, autoregressive serial generation, and memory bandwidth all contribute to a floor that scaling alone cannot reduce.
AI Use Disclosure
Content Authorship: ~70% human, ~30% AI-drafted.
AI Use, Claude Opus 4.7 (Editor and Co-author Role): Identified gaps in the original 2025 essay's treatment of inference latency and drafted the "What Else Costs at Inference Time" and "Why This Matters" sections. Also broadened "GPTs" to "Transformer-based architectures" in the opener and added one corrective line on the sparse-attention joke. Original sections on attention compute, the letter analogy, sparse attention, and positional encoding are unchanged from the January 2025 draft.