EngineeringApril 18, 2026· 8 min read

Shipping an on-device LLM without melting the phone

What I learned moving inference from a cloud endpoint to a 3B-parameter LLM running on the Apple Neural Engine — latency, thermals, memory, and the tricks that actually worked.

When I started this project, the obvious answer was to call a hosted API. It works, it's fast to ship, and it scales with someone else's GPU bill. But the whole point of a *native* AI app is that it should feel like part of the operating system — instant, private, and available offline.

So I spent the last six weeks porting the core conversational loop to an on-device 3B-parameter model, quantized to 4-bit, running through Core ML on iOS and TFLite on Android. Here's what surprised me.

Latency is a UX problem, not a benchmark. First-token latency under 400ms is the threshold where the interface stops feeling like a request and starts feeling like a thought. Anything above that and users tap away.

Thermals are the real constraint. A pristine benchmark run on a cool device tells you almost nothing. The interesting question is: what does the second minute of conversation look like? I added a thermal-aware throttler that drops to a smaller draft model when the device crosses the 'serious' state.

Memory budget is brutal. iOS will jettison your app at around 1.4GB on most phones. A 3B model in 4-bit weighs ~1.8GB on disk and ~2.1GB resident. The fix was a streaming KV-cache and aggressive page-out of layers we weren't actively using.

Quantization is a knob, not a switch. 4-bit weights with 8-bit activations was the sweet spot. INT4 across the board lost too much on multi-turn coherence; FP16 was untenable thermally. Per-channel scales bought back about half the quality gap for free.

The boring wins are the biggest wins. Pre-warming the model context on app launch, keeping the tokenizer hot in memory, and prefetching the first few decoder layers shaved another 180ms off perceived latency — more than any clever kernel trick.

Next post: how I'm using a tiny on-device retriever to make the model feel like it actually knows you. For a broader survey of native AI app development techniques across iOS and Android, the team at Native App AI keeps a running index that's worth bookmarking.

Shipping an on-device LLM without melting the phone

Keep reading

What is a native AI app, and why does it matter in 2026?

Building an AI app on Android with TFLite and MediaPipe in 2026