LLM TOPS Calculator

Inputs

Model Quantization Target tokens / second Context length (tokens) Real-world efficiency Your hardware

Results

Required raw TOPS

—

Theoretical floor: 2 × params × tokens/sec.

Required effective TOPS

—

Raw ÷ efficiency. What you actually need on the spec sheet.

Weights memory

—

RAM needed just to hold the model.

KV cache memory

—

Grows with context length.

Total memory

—

Weights + KV cache + ~1 GB overhead.

Bandwidth ceiling

—

Max tokens/sec your RAM bandwidth allows (batch=1).

Pick a model and a target to see the verdict.

What fits your hardware

Tokens/sec ceiling (bandwidth-bound, batch=1) for a curated set of models on your selected hardware at the chosen quantization. Bars are coloured against your tokens/sec target.

Why TOPS isn't the whole story on a Mac

Apple advertises Neural Engine TOPS (38 on M4, for example), but most local LLM runtimes — llama.cpp, Ollama, MLX, LM Studio — run on the GPU, not the Neural Engine. For batch=1 inference (one user, one prompt), the bottleneck is almost always memory bandwidth, not compute.

Rule of thumb: you can't generate tokens faster than your RAM can stream the model's weights through the chip. A 4 GB quantized model on a chip with 120 GB/s bandwidth caps out around 30 tokens/sec — no matter how many TOPS the spec sheet says.

That's why this calculator shows both numbers. If you're trying to decide whether 38–50 TOPS is enough for professional use, the honest answer is: yes, comfortably — but check the bandwidth ceiling for the model you actually want to run.