Throughput measures volume: requests per second, tokens per second, or both. At small scale you don't think about it. At scale, it determines how many GPUs you need (self-hosted) or whether you'll hit rate limits (hosted).
The biggest lever is batching — processing many requests together. Hosted APIs do this for you behind the scenes; self-hosted setups need careful tuning (vLLM, TensorRT-LLM, SGLang). Quantization (running at 8-bit or 4-bit precision) often doubles throughput with negligible quality loss.
Bring this to your business
Knowing the term is one thing. Shipping it is another.
We do two-week AI Sprints — one term, one workflow, into production by Day 10.