How much thinking can you buy — and what quality — for a fixed budget?
AI tasks span 5 difficulty tiers, from "set a timer" to "prove a theorem." A $1,000 Mac Mini (amortized to $42/month) gives you unlimited easy thinking but can't touch hard problems. API models can solve harder problems but charge per token. When do these lines cross — when does a $1,000 box handle 95% of a knowledge worker's daily needs?
The Landscape Today
Reliability rate per difficulty tier — regardless of cost. How often does the model get it right?
What each successfully solved task costs via API. Local models = ~$0 marginal cost (hardware is sunk).
Highest difficulty tier reliably cleared, projected through 2030. The dashed line marks Tier 3 — "professional grade" — where local hardware handles ~95% of daily cognitive work.
Estimated distribution of daily AI tasks by difficulty tier for a power user
Different models are fundamentally different shapes. Local = fast & efficient but narrow. Frontier = capable but expensive.
Blue border = local models on $1K hardware. Purple border = API models (pay per token).
| Model | How you run it | Params | MMLU | Tier Ceiling | Speed | Marginal $/task | Monthly cost |
|---|---|---|---|---|---|---|---|
| Qwen 3 3.5B Q4 | Local — $1K Mac Mini | 3.5B | ~62% | Tier 1 | ~90 tok/s | ~$0 | $42 amortized |
| Qwen 3 8B Q4 | Local — $1K Mac Mini | 8B | ~72% | Tier 2 | ~50 tok/s | ~$0 | $42 amortized |
| Llama 4 Scout 17B Q4 | Local — $1K Mac Mini | 17B (109B total MoE) | ~78% | Tier 2+ | ~30 tok/s | ~$0 | $42 amortized |
| GPT-5.4 nano | API — pay per token | undisclosed | ~76% | Tier 2 | ~200 tok/s | ~$0.001 | ~$5–15 |
| GPT-5.4 mini | API — pay per token | undisclosed | ~84% | Tier 3 | ~185 tok/s | ~$0.01 | ~$20–50 |
| Claude Sonnet 4.6 | API — pay per token | undisclosed | ~87% | Tier 3+ | ~46 tok/s | ~$0.03 | ~$50–120 |
| GPT-5.4 | API — pay per token | undisclosed | ~92% | Tier 4 | ~72 tok/s | ~$0.05 | ~$80–200 |
| Claude Opus 4.6 | API — pay per token | undisclosed | ~90% | Tier 4 | ~48 tok/s | ~$0.08 | ~$100–300 |
Tier definitions: T1 = Reflexive (reminders, lookups) · T2 = Competent (emails, summaries, planning) ·
T3 = Professional (taxes, contracts, debugging) · T4 = Expert (novel strategy, complex research) ·
T5 = Frontier (unsolved problems)
Amortized hardware: $1,000 / 24 months = $42/mo + ~$5/mo electricity. Marginal cost per task = ~$0.
API monthly: Estimated for moderate power user (~2hr/day, mix of task types).
Tier ceiling: Highest tier where model achieves >85% reliability on representative tasks.
Speed sources: API models from Artificial Analysis (Mar 2026).
Local models estimated for M4 Mac Mini (24GB) with MLX, Q4 quantization.
MMLU (Massive Multitask Language Understanding):
57-task benchmark spanning STEM, humanities, social sciences, and professional domains.
Measures broad knowledge and reasoning — roughly "how much of a college-educated generalist's knowledge does this model have?"
Not a ceiling on capability, but a useful proxy for general competence.