memnode
Sign InSign Up
Back to Articles

Context Engineering for AI Agents: Memory Is the Half Nobody Automates

Context engineering replaced prompt engineering as the 2026 buzzword, but most of the effort goes into sliding windows and summarization. The durable memory layer, what persists with provenance and correction across sessions, is the half teams skip, and the half that actually compounds.

memnode10 min read
context engineeringagent memorymemnodecontext windowcontext rotretrievalmemory architecture

Sometime in late 2025 the industry quietly retired the phrase prompt engineering and replaced it with context engineering. The rebrand was not vanity. Prompt engineering implied a single block of text you tune by hand, a clever incantation that coaxes a better answer out of a model. That mental model never survived contact with a long-running agent. A production agent does not see a prompt. It sees an assembled context window: a system instruction, the conversation history, the results of tool calls, retrieved documents, the contents of a scratchpad, and whatever the framework decided to keep or drop on this particular turn. Context engineering is the discipline of architecting all of that, deliberately, so the model receives the right tokens at the right moment instead of whatever happened to accumulate.

Here is the claim worth arguing about. The single biggest cause of production agent failure is not a weak model. It is bad context. The teams shipping reliable long-running agents are not the ones with the largest context windows or the newest frontier model. They are the ones with the most rigorous context management. And within that discipline there is a split nobody likes to talk about: most context-engineering effort goes into the transport problem, sliding windows and summarization and prompt assembly, while the durable memory layer, what persists with provenance and correction across sessions, is the half teams skip. It is also the half that actually compounds. This piece is about why that gap exists, and what closing it looks like.

What context engineering actually means

Strip the buzzword down and context engineering is the practice of governing everything the model sees at inference time. That is a larger surface than people expect, so it helps to enumerate it. On any given turn the context window is composed of several streams, each of which you can engineer independently:

  • The system layer. Instructions, role, output contract, the tool definitions themselves. Mostly static, but the tool descriptions alone can eat thousands of tokens if you are careless.
  • Conversation history. The running transcript of user and assistant turns. This is the part that grows without bound and is the usual target of compaction.
  • Tool results. Raw output from function calls, API responses, search hits. Often the largest and noisiest stream, and the one most likely to bury the signal.
  • Retrieved context. Documents and chunks pulled in by RAG for this query, ranked by similarity, frequently irrelevant despite the high score.
  • Durable memory. Facts the agent learned in earlier sessions and chose to keep: user preferences, prior decisions, corrections, the state of the world as the agent last understood it.
  • Cache state. What is prefix-stable across turns so the provider can serve it from cache, which is as much a cost and latency lever as a quality one.

Context engineering is the set of decisions about which of these streams to include, in what order, at what fidelity, and when to evict. Five of the six are about this turn. Only one of them, durable memory, is about every turn that will ever come after. That asymmetry is the whole story, and it is why some observers have started framing memory itself as the new paradigm of context engineering rather than a feature bolted onto it.

Bigger windows do not fix it

The reflexive answer to context problems is more window. If the agent forgets, give it a million tokens. If retrieval misses, stuff the whole corpus in. This is wrong for a reason that has a name now: context rot. As the window fills, the model's ability to use any single token reliably degrades. It is not a hard cliff at the advertised limit. It is a gradual decline in attention precision, recall of facts buried in the middle, and instruction-following, that starts well before the window is full.

A large context window is a budget, not a memory. Spending all of it is the failure mode, not the goal. The skill is deciding what not to put in.

There is also a brutal economic argument. Tokens in the window are paid for on every single turn. An agent that re-sends a 200,000-token history forty times in a session is not remembering, it is paying rent on the same data over and over, and getting worse answers as the window degrades. The teams that win treat the context window as the most expensive and most fragile resource they own, and they manage it like one. The point is made at length in stop putting agent memory in the context window: the window is working memory, not storage, and conflating the two is the root error.

The hierarchical pattern: hot, warm, cold

The teams who get this right almost always converge on the same shape, borrowed straight from how a CPU manages memory. Context lives in tiers, and the engineering is about moving information between them as its relevance changes:

  1. Hot, kept verbatim. The current task, the last few turns, the active tool results. This is what the model needs at full fidelity right now, and it lives in the window unmodified.
  2. Warm, summarized. Older turns and resolved sub-tasks compressed into a running summary. Lossy by design, present so the agent does not lose the thread of a long session, but not at full token cost.
  3. Cold, retrievable. Everything else, held outside the window in a durable store and pulled in only when a query needs it. This is where the agent's actual long-term knowledge lives, and it is the tier where bigger context windows do nothing for you.

Most context-engineering tooling is excellent at the hot and warm tiers. Sliding windows, recency weighting, rolling summarization, prompt templating, prefix caching, these are mature, well-understood, and increasingly automated by the frameworks. The cold tier is where the discipline thins out, because the cold tier is not a transport problem. It is a memory problem, and memory is harder than transport.

Compaction is not memory

The most common mistake is assuming the warm tier is the memory layer. It is not. Summarizing old turns to fit the window is compaction, and compaction is lossy in a way that is fundamentally unaccountable. The detail your user asks about next week is precisely the one the summary dropped, and once it is gone there is no provenance trail to recover it, no record that it was ever known, no way to ask where a recalled claim came from. We have made this argument before and it holds node for node here: compaction is not memory. A summary is a compression of a transcript. A memory is a durable, addressable fact with a lineage. They are different objects with different lifecycles, and treating one as the other is how agents end up confidently citing things they were corrected on three sessions ago.

The tell is simple. If the only way your agent retains knowledge across sessions is by re-summarizing the transcript and feeding the summary back in, you do not have a memory layer. You have a more elegant sliding window. The moment the relevant turn falls out of the summarization horizon, the knowledge is gone, and your agent's behavior silently changes with no audit trail explaining why.

What a real memory layer adds

A durable memory layer is the part of context engineering that operates outside the window, off the hot path, and persists across every session. It is not a bigger prompt and it is not a transcript archive. It is a store with a specific contract. At memnode we organize that contract around four operations, and they are worth naming because each one closes a gap the transport layer cannot:

  • Record. Write a fact the agent learned, tagged with its source and the task that produced it, not a blob of conversation. The unit of memory is a claim, not a turn.
  • Recall. Retrieve the facts relevant to the current task, ranked by more than raw similarity, so the agent gets the chain of facts a task actually needs rather than the top-k nearest embeddings. Why similarity is the wrong default is its own subject in why AI memory layers recall the wrong thing.
  • Lineage. Every fact knows where it came from, which task wrote it, from which source, and what it superseded. When the agent recalls something, you can ask why, and get an answer.
  • Correction. When a fact is wrong, you correct it and the correction is itself a recorded, dated event. The store does not silently overwrite, it supersedes, and the old belief and the reason for the change remain inspectable.

The property that ties these together is that the memory layer is inspectable, not a black box. You can read what your agent believes, see how it came to believe it, and change it deliberately. That is the opposite of an embedding index, where the only debugging tool is to wipe the store and start over. The practical mechanics of building this into an agent, rather than the theory, are walked through in how to give your AI agent long-term memory.

This is also why the memory half is the half that compounds. The hot and warm tiers reset every session; their value does not accumulate. A durable, corrected, provenance-tracked store gets morevaluable the longer the agent runs, because it accretes verified knowledge and prunes contradicted knowledge. An agent with a good memory layer is one that has been right about your domain a thousand times and remembers which thousand. That is not a context-window feature. You cannot buy it with more tokens.

Three questions teams ask

Is context engineering just RAG with a new name? No. RAG is one stream feeding the window, the cold-tier retrieval of external documents. Context engineering governs all six streams and, critically, the durable memory of facts the agent itself learned, which is not a document corpus at all. RAG retrieves what someone else wrote. A memory layer retrieves what the agent decided and was corrected on. Conflating them is how teams end up with a great document index and an agent that still forgets every user preference.

If I use a frontier model with a huge context window, do I still need a memory layer?Yes, and more than before. A bigger window raises the ceiling on the hot tier, but it does nothing for persistence across sessions and it makes context rot more expensive to ignore, because now you can afford to fill it with noise. The agents that scale are the ones that put less in a large window, not more, and keep the durable knowledge in a store built for it.

Can I just keep appending to the system prompt? For a while, until it stops scaling. A growing system prompt has no eviction policy, no provenance, no correction path, and it is paid for on every turn. It is a memory layer with all the durability of a memory layer and none of the management. The moment two facts in it contradict, you have no mechanism to resolve the conflict except editing a wall of text by hand. That is the point where teams reach for a real store.

Engineer the half that compounds

Context engineering is real and it matters, but the industry's attention is lopsided. The sliding window is solved, summarization is solved, prefix caching is solved, and the frameworks automate all three. The durable memory layer, the only tier whose value compounds across sessions, is still treated as an afterthought, usually as a bigger prompt or a vector index with no lineage. That is the half worth engineering, and it is the half that decides whether your agent gets better the longer it runs or just gets more expensive.

memnode is the durable, inspectable memory layer for exactly this. It speaks MCP for local agent tooling and exposes a hosted API for remote clients, and it is built around the loop this whole piece argues for: record what the agent learns with its source, recall the facts a task actually needs, follow the lineage of any belief back to the task and source that produced it, and correct what turns out to be wrong without losing the trail. You can read what your agent knows instead of guessing. If you are doing the context-engineering work and have only engineered the transport half, the memory half is where the compounding lives.