There’s a video making the rounds of someone running a near-Sonnet-quality model on a MacBook Pro, offline, on a plane. It’s genuinely impressive. Apple’s M5 Max put 128GB of unified memory and per-core Neural Accelerators into a laptop, and a 30B-class model now answers in real time with no API key, no network, no bill.
That demo raises an obvious question: if a laptop can do this, why pay anyone for inference?
Work the numbers and the excitement gets more complicated. Adding global electricity prices instead of one local rate turns this into a decision: when buying Apple Silicon for local inference actually pays off, and when you should route through a service like haimaker.ai instead.
The setup
The reference machine is an M5 Max MacBook Pro. Call it ~$4,299 for a 64GB configuration, more if you want the full 128GB. It runs a Gemma 4 31B-class model that lands somewhere near Claude Sonnet on everyday tasks. Under inference load it pulls 50–100 watts and produces roughly 15–40 tokens per second depending on the model, quantization, and context length.
Those are the only numbers we need. Everything else is arithmetic.
Electricity is a rounding error
Most “is local cheaper” arguments start with the power bill, so let’s kill that variable first.
Energy per token is just power divided by throughput. At a working midpoint of ~65W and ~25 tokens/sec, that’s about 2.6 joules per token, or roughly 0.7 kWh per million tokens. Across the full 50–100W and 15–40 t/s range, you land somewhere between 0.35 and 1.85 kWh per million tokens.
Now apply real electricity prices. As of Q1 2026, residential rates by region look like this (GlobalPetrolPrices):
| Region | Residential $/kWh | Electricity cost per 1M tokens (~0.7 kWh) |
|---|---|---|
| Asia | $0.085 | ~$0.06 |
| Africa | $0.139 | ~$0.10 |
| North America | $0.148 | ~$0.10 |
| World average | $0.174 | ~$0.12 |
| South America | $0.207 | ~$0.14 |
| Europe | $0.255 | ~$0.18 |
| Oceania | $0.257 | ~$0.18 |
Even in expensive Europe or Oceania, even at the inefficient end of the throughput range, you’re looking at well under $0.50 per million tokens in electricity. For most of the world it’s a dime. The power bill is not why local inference is expensive.
Where the math breaks
The real cost is the $4,299 sitting on your desk, and how few tokens you actually push through it.
Here’s the trap. People imagine “I use it 8 hours a day.” But you don’t generate tokens 8 hours a day. You read, think, type, and sit in meetings. Actual token generation for a heavy individual user is closer to 1–2 hours of wall-clock generation per day.
Run the amortization at ~25 tokens/sec:
| Utilization | Tokens/day | Tokens over 3 years | Hardware cost per 1M tokens |
|---|---|---|---|
| Realistic personal (2 hrs/day generating) | ~0.18M | ~197M | ~$22 |
| Heavy solo dev (6 hrs/day generating) | ~0.54M | ~590M | ~$7 |
| Pinned 24/7 (server-style) | ~2.16M | ~2,365M | ~$1.80 |
Add the electricity (a dime or two) and that’s your true cost per million tokens. Stretch the machine to 5 years instead of 3 and the 24/7 number drops near $1.10/M, but only if you keep it saturated every hour of every day for five years, which no person does with a laptop.
Compare that to cloud. A Gemma-class open model through a routed API sits in the $0.10–$0.50 per million tokens range, generates 2–3x faster (cloud providers hit 60–70 t/s vs. the laptop’s 15–40), and costs $0 upfront. The local machine only reaches cost parity if you run it like a datacenter, and at that point you’ve bought a slow, single-tenant datacenter with a keyboard attached.
When local makes sense
This isn’t an argument against local inference. It’s an argument for buying it for the right reasons:
- Privacy and regulated data. If prompts contain PII, health records, or code you contractually can’t send off-box, local isn’t a cost decision; it’s a compliance requirement, and the math doesn’t matter.
- Air-gapped or offline work. Planes, ships, field sites, secure facilities. No network means no API, full stop.
- You already own the Mac. This is the big one. If the M5 is a machine you bought to do your job anyway, the hardware is a sunk cost and your marginal price really is ~$0.10–0.20/M in electricity. It’s just slow. For background, latency-insensitive batch work on a machine that would otherwise idle, that’s a great deal.
- Learning and experimentation. No keys, no rate limits, no metered anxiety while you tinker. The freedom to break things cheaply has real value.
When haimaker makes sense
For everything pointed at users or running at scale, a routed cloud setup wins on every axis that isn’t privacy:
- haimaker.ai — compare and access hundreds of models with unified pricing and benchmarks, and use auto-routing so cheap queries hit cheap models and only hard ones touch frontier tiers. No $4,299 capex, no machine to amortize, 2–3x the throughput, and you pay per token instead of per laptop.
- Direct provider APIs — fine if you only ever need one model and want to manage keys and price changes yourself.
- Local on hardware you own — keep this lane for the privacy, offline, and tinkering cases above.
The honest recommendation for most people is hybrid: run the genuinely sensitive or offline work on the Mac you already have, and route everything else (production traffic, agent loops, bursty variable load, the team’s shared usage) through haimaker so you’re never amortizing idle silicon or waiting on a laptop to finish a token stream.
Buy Apple Silicon because you need a great computer and occasionally want private, offline inference. Don’t buy it as a strategy to dodge a cloud bill that, once you do the math, was already smaller than the machine.