On-device RAG: giving a small model a long memory
How to build a retrieval layer that runs entirely on the phone, so a 3B-parameter native AI app can feel like it actually knows you.

A 3B-parameter model has a small head. It can reason fine, but it can't *remember* the conversation you had three weeks ago, or the PDF you skimmed last Tuesday. The fix is the same on a phone as it is in a data center: retrieval-augmented generation. The interesting part is doing it without ever leaving the device.
The stack: a small embedding model (~30M params) running in Core ML / TFLite, a SQLite database with the `sqlite-vss` extension for vector search, and a re-ranker that's just the main LLM scoring the top-k. Total cold-start cost: about 60ms on an iPhone 15.
Chunking is the whole game. Bad chunking means your retriever returns plausible-but-irrelevant snippets, and the model confidently builds an answer on top of them. I landed on semantic chunking with a 512-token window and a 64-token overlap, plus a small heading-aware splitter for structured documents.
Per-app indexes, not one global index. Mixing your Notes, your messages, and a research PDF into the same vector space sounds clever but produces messy retrievals. Keep them in separate indexes and let the orchestrator choose which to query based on the user's intent.
Updates have to be incremental. Re-embedding 10,000 notes every time the user adds one is a battery massacre. Ship a write-ahead log that batches embeddings during charging windows. The user never notices.
Privacy compounds. Because retrieval is local, the personal context never leaves the device — which means even when the user *does* escalate to a cloud model, you can choose to send the question without the personal grounding. That's a real product lever, not just a privacy posture.