Inputs
Paste a model URL (e.g. Qwen/Qwen2.5-7B-Instruct) and click Import. Reads config.json + safetensors metadata directly from Hugging Face. Gated repos (meta-llama, google/gemma, mistralai…) need an auth token and won't work here.
Have access to a gated repo? Paste config.json here
Open the model's config.json on Hugging Face (after accepting the license), copy the whole file, and paste it below. No token, no fetch — everything stays in your browser.
Results
What fits your hardware
Tokens/sec ceiling (bandwidth-bound, batch=1) for a curated set of models on your selected hardware at the chosen quantization. Bars are coloured against your tokens/sec target.
Why TOPS isn't the whole story on a Mac
Apple advertises Neural Engine TOPS (38 on M4, for example), but most local LLM runtimes — llama.cpp, Ollama, MLX, LM Studio — run on the GPU, not the Neural Engine. For batch=1 inference (one user, one prompt), the bottleneck is almost always memory bandwidth, not compute.
Rule of thumb: you can't generate tokens faster than your RAM can stream the model's weights through the chip. A 4 GB quantized model on a chip with 120 GB/s bandwidth caps out around 30 tokens/sec — no matter how many TOPS the spec sheet says.
That's why this calculator shows both numbers. If you're trying to decide whether 38–50 TOPS is enough for professional use, the honest answer is: yes, comfortably — but check the bandwidth ceiling for the model you actually want to run.