Context Assembly Explained: How AI Builds Smart Prompts From Memory

May 2026 · 10 min read · Fran Olivares, Founder of OlivaresAI

Context assembly is the step where a memory-aware AI builds the system prompt for the next user message: it runs a hybrid keyword + semantic search across the memory store, scores results by a weighted combination of relevance, importance, recency, frequency and confidence, fits the top-ranked entries into a per-tier token budget alongside the active identity blocks, and returns the structured context to the model — all in under 100 ms. Without it, persistent memory is a database; with it, the model behaves as if it remembers because the right slice of memory is in front of it for every turn.

A persistent memory store on its own does nothing. The store has to be queried, scored and shaped into a system prompt that fits the model's context window before the next user message lands. That step — context assembly — is the difference between "we have a memory database" and "the AI remembers". This guide is the long-form companion to the technical reference at /docs/context-assembly and walks through every stage of the pipeline, the numbers Alma uses by default, and the trade-offs you can tune.

Why is context assembly the key step?

Because the model only sees what's in the prompt. A memory store with ten thousand entries is invisible to the model unless something selects the right thirty for this turn. If selection is wrong, the model misses the relevant fact and produces a generic answer. If selection is too broad, the prompt blows past the context window or wastes tokens on noise. Assembly is the gatekeeper — a quiet step the user never sees, but the entire feel of "the AI remembers" sits on its quality.

Assembly also hits a hard latency budget. The user is waiting; anything above ~100 ms starts feeling sluggish before a single model token has streamed. This is why assembly leans on indexed search rather than full scans, why scoring is a weighted sum (not an LLM call), and why the per-tier token budgets are computed up front instead of negotiated dynamically.

How does the assembler retrieve candidates?

Hybrid search across all three memory layers — memories, episodes, procedures — using both keyword and semantic signals. The user's query is embedded with the same model that indexed the store (bge-m3 1024-dim in Alma's default configuration), and the embedding runs against the vector index to surface semantically similar entries. In parallel, a keyword search hits the relational index for exact-term matches that semantic search sometimes misses (proper nouns, code identifiers, rare technical terms).

Both result sets are merged, deduplicated, and capped at the candidate budget (default 100 per layer — the maximum the underlying vector index supports per query). The candidate pool is what flows into scoring; nothing past this stage rescues an entry the search didn't surface.

What signals does Alma use to score memory candidates?

Five signals, weighted as follows in the production scoring function:

The weights are deliberately tuned: relevance dominates, but the secondary signals matter when relevance ties (which happens often in dense memory stores). The weights are inviolable invariants in the codebase — changes require an A/B benchmark because the user-felt quality of "did the AI remember the right thing" depends on this exact mix.

How does the assembler decide what fits?

Each tier (memories, episodes, procedures, Soul blocks) has its own token budget. Defaults: memories ~2 K tokens, episodes ~1 K, procedures ~500, Soul blocks ~500. Total ~4 K — well under any modern model's context window, and small enough to stay cache-friendly. Within each tier, scored entries are added in rank order until the budget is hit.

The budget exists for two reasons. First, the model's effective context shrinks if you cram it past a certain density — relevant things at the bottom of a 100K-token prompt are de facto invisible to the attention pattern. Second, prompt caching only works if the cached prefix is stable; bloating the prompt with low-signal memories busts the cache and makes every turn pay full-price tokens. Tight budgets keep both quality and economics in line.

What does the final assembled prompt look like?

A structured system prompt with five sections (in this order): identity (active Soul blocks rendered as XML), preferences (high-importance memory entries flagged as preferences), relevant facts (top-scored memories for this query), recent context (top-scored episodes), workflows (top-scored procedures). The structure matters: putting identity at the top means it gets full attention; putting workflows at the bottom means they're consulted only if the model decides the query is procedural.

The user message is then appended as the next turn. The model receives the assembled prompt + user message and produces a response. From the user's perspective, the AI just answered. Under the hood, assembly silently consulted thousands of memory records and showed the model the right thirty.

How fast is context assembly in practice?

In Alma's production deployment, end-to-end assembly latency sits in the 30-80 ms range for a typical user (a few hundred memories, a dozen episodes). Vector search dominates (~20-40 ms), keyword search runs in parallel (~5-10 ms), scoring is single-digit ms, and the prompt build is essentially free. The 100 ms target is met with comfortable headroom even for users with thousands of memories — the candidate cap and tier budgets keep work bounded as the store grows.

How does the assembler handle conflicting memories?

Pre-scoring, a contradiction-detection pass over the candidate pool flags pairs in the 0.75-0.92 similarity range that semantically conflict. The newer entry wins by default; the older one is marked superseded and removed from the candidate set for this turn (and globally, on the next consolidation pass). This prevents the model from receiving "you said X" alongside "you said not-X" and improvising a synthesis the user never agreed to.

The full lifecycle (deduplication, supersession, decay) is documented in the complete persistent memory guide; assembly is just where those lifecycle decisions show up at query time.

Is context assembly the same as RAG?

Architecturally similar (both retrieve, both rank, both inject into the prompt) but the corpus and lifecycle are different. RAG retrieves from an external document corpus authored once and re-indexed on a schedule; the entries don't usually evolve. Memory assembly retrieves from the user's own continuously-growing store, with entries that contradict, merge and decay. The scoring weights also differ — RAG ranks mostly by similarity and document authority; memory assembly weights importance, recency and frequency because those signals matter more when the store is personal. See the deeper comparison in Persistent Memory vs RAG.

Can I tune assembly for my workload?

Yes. The POST /api/v1/context/assemble endpoint accepts overrides for the per-tier budgets, the minimum score threshold, the candidate cap and the boost weights for category tags (so a PM agent can boost decisions, a writer's agent can boost voice rules). Most teams stay on defaults — they were tuned to be useful out of the box — but the levers exist for specialised verticals.

How do I see context assembly in action?

Get started at alma.olivares.ai, populate twenty or thirty memories about a project you care about, then start a chat. The model's first response in the new conversation will reference specific facts from your memory store — that's assembly, just hidden behind the user-facing chat. For developers integrating directly: the REST API exposes the raw assembled prompt so you can inspect exactly what was selected for each query.

Related reading: Context Assembly technical reference · Three-Layer Memory Architecture · Persistent Memory for AI: Complete 2026 Guide · Persistent Memory vs RAG · Soul Engine Explained.

See plans