← All posts
EngineeringMarch 4, 2026· 7 min read

Neural Engine vs. GPU vs. CPU: where should mobile inference actually run?

A field guide to picking the right accelerator on iOS and Android — with the gotchas nobody warns you about until your battery graph cliffs.

Neural Engine vs. GPU vs. CPU: where should mobile inference actually run?

Modern phones have at least three places to run a model: the CPU (general-purpose, slow, hot), the GPU (parallel, medium-power, also hot), and a dedicated neural accelerator (Apple Neural Engine, Tensor TPU, Qualcomm Hexagon). Picking the wrong one is the difference between a 30-second response and a 3-second one — and between a 4% battery hit and a 22% one.

iOS rule of thumb. Default to `MLComputeUnits.all` and let Core ML place ops. For LLMs specifically, force `cpuAndNeuralEngine` once you've confirmed the model converts cleanly — the GPU path on iOS often loses to the ANE for transformer workloads, and you save the GPU for rendering.

Android is messier. There is no single answer. NNAPI is the portable choice but its op coverage for modern transformer blocks is uneven. The vendor delegates (Qualcomm QNN, MediaTek Neuropilot, Samsung ENN) are faster where they work and absent where they don't. I ship all three and benchmark on first launch.

The thermal trap. The ANE is brutally efficient — until the rest of the SoC heats up around it. If the camera or GPS is active, your 'free' neural compute starts costing real watts. Build a power budget that knows what else is running.

The conversion trap. A model that runs beautifully in PyTorch can fall off the ANE entirely if a single op (custom attention mask, exotic activation) isn't supported. The conversion pipeline silently falls back to the CPU, your latency triples, and you have no idea why. Always check the op assignment report after conversion — don't trust the summary.