Just Think AIStart thinking

GlossaryTerm

KV Cache

The memory a transformer uses during generation to avoid recomputing previous tokens.

The KV (key-value) cache is the internal memory a transformer model maintains during text generation. When generating token by token, the model computes attention keys and values for each previous token. Without caching, it would recompute those values for every new token — an O(n²) cost. With the KV cache, those values are stored and reused, making generation linear in time.

From a user perspective you mostly don't see this directly — it's handled by the inference server. But it has practical implications: (1) Prompt caching (what API providers sell as a feature) leverages KV cache logic at the API level — identical prompt prefixes are served from cache at lower cost. (2) Long-context efficiency — the KV cache grows with context length and is the primary reason long contexts require more GPU memory and cost more. (3) Self-hosted deployments need to size GPU VRAM to fit both model weights and the KV cache for the expected batch size and context length.

Bring this to your business

Knowing the term is one thing. Shipping it is another.

We do two-week AI Sprints — one term, one workflow, into production by Day 10.