Stop Putting Agent Memory in the Context Window (2026)
The context window is a bounded, volatile, flat buffer - not a memory. Why agent memory belongs in a durable external layer, the four kinds it must hold (working, semantic, procedural, episodic), and why an inspectable record-recall-lineage-correction loop beats a black-box retriever.
There is a recurring thread on the agent-building forums that keeps getting upvoted because it keeps being true. The most direct version, posted to r/AI_Agents this month, is the title of this piece: stop putting your AI agent's memory inside the LLM context window. It is a deceptively simple instruction, and almost everyone building an agent gets it wrong the first time, because the context window is right there and stuffing things into it is the path of least resistance. You append the last conversation, you paste in the user profile, you glue on a few retrieved snippets, and for a while the agent seems to remember. Then the session ends, or the window fills, or the user comes back on a different machine, and the memory evaporates. The agent greets a returning user as a stranger. This is not a prompt-engineering problem. It is an architecture problem, and the fix is to treat memory as a durable layer that lives outside the model, with its own lifecycle, instead of as a transient buffer the model happens to be holding right now.
This post is about why the context window is the wrong place for memory, what a durable external memory layer actually has to hold, and why the four kinds of memory that an agent needs are not interchangeable. The framing of those four kinds is borrowed from a short, surprisingly popular explainer that IBM put out, Martin Keen's The Four Types of Memory Every AI Agent Needs, which has pulled tens of thousands of views precisely because it names a structure that most agent tutorials skip. Working, semantic, procedural, episodic. Each one has a different shape, a different update rule, and a different reason it cannot just be a window of recent tokens.
The context window is a buffer, not a memory
The confusion starts with a category error. The context window feels like memory because the model can refer to things in it, but it is a working buffer with three properties that disqualify it as the place where memory lives. It is bounded: every model has a hard token limit, and the more you fill it the more you pay per call and the slower each call gets. It is volatile: when the request finishes, the window is gone, and the next request starts from whatever you decided to reconstruct. And it is flat: everything in it has the same status. A fact the user stated six weeks ago and an offhand remark from thirty seconds ago sit side by side as undifferentiated text, with no notion of which one is current, which one is trusted, and which one has since been corrected.
Those three properties interact badly. Because the window is bounded, you eventually have to throw things out, and the usual move is compaction: summarize the old turns into a shorter blob and keep going. Compaction keeps the conversation flowing, but it is lossy by construction, and it is not the same operation as remembering. We have written about that confusion at length in compaction is not memory; the short version is that summarizing a transcript so the model can continue is a context-management trick, not a durable record you can query, correct, or cite next week. The transcript that produced the summary is usually gone, so you cannot go back and check what was actually said.
Memory you cannot inspect, cannot correct, and cannot recall from a different process is not memory. It is a buffer with a good marketing department.
The cleanest way to see the gap is to watch an agent across sessions and across machines. One builder described the modern setup as a few Claude Code sessions, a few Codex sessions, agents with different permissions and memory, and sessions stuck on one machine. That last phrase is the tell. If your agent's memory is in its context window, it is stuck on the machine and in the process that holds that window. Start the same agent on your laptop in the morning and on a server overnight and they share nothing. The whole point of memory is continuity across exactly those boundaries, and a buffer cannot give you that.
Memory is a layer, and it lives outside the model
The alternative is to make memory an explicit layer the agent talks to, with operations of its own, and to keep the context window for what it is good at: holding the small, relevant slice of memory that this particular turn needs. The model does not own the memory. The model borrows from it, one query at a time. This is the consensus the forum threads keep converging on, and it is the same split that shows up in every serious framework: a short-term scratchpad that is thread-scoped and ephemeral, and a long-term store that outlives any single thread. We pulled that distinction apart in detail in the LangGraph checkpointer versus store piece, because LangGraph happens to give the two things two different names and a lot of people wire up the thread-scoped one and assume they have memory.
A durable external layer has to do four things the context window cannot. It has to persist across sessions, processes, and machines, so the laptop agent and the overnight server agent read the same store. It has to be queryable, so the agent can ask for exactly the slice it needs rather than dragging the whole history into the prompt. It has to be writable with a lifecycle, so a fact can be added, superseded, or retired without leaving a contradictory twin behind. And it has to be inspectable, so when the agent says something you can ask where that came from and get an answer that is not a shrug. Hold onto that last property. It is the one most retrieval stacks quietly drop, and it is the one this whole argument turns on.
The four kinds of memory, and why they are not one thing
The reason you cannot solve this with a single bucket is that the word memory covers at least four distinct workloads, and they have genuinely different shapes. The IBM framing names them working, semantic, procedural, and episodic. They differ in how long they live, how often they change, how they are written, and how they are read. Treat them as one and you will optimize for the average of four incompatible workloads, which is to say you will be mediocre at all of them.
Working memory
Working memory is the current task: the goal in flight, the intermediate results, the few recent turns, the variables the agent is juggling right now. This is the one workload that genuinely belongs in or near the context window, because it is small, hot, and short-lived. The mistake is not putting working memory in the window. The mistake is putting everything else there too, and then treating the whole pile as if it were durable. Working memory should be aggressively pruned and allowed to expire; when the task ends, most of it should be discarded, and the small residue worth keeping should be promoted into one of the durable kinds below.
Semantic memory
Semantic memory is distilled fact: the user prefers metric units, the production database is in the Frankfurt region, the company's refund window is fourteen days. These are not events; they are standing truths that the agent should be able to recall by meaning at any time, in any session. Semantic memory changes slowly, and when it changes it usually supersedes rather than appends: the new refund window replaces the old one. This is the workload people reach for a vector database to serve, and it is the one place vector similarity is genuinely the right primitive, as long as the facts carry a current-versus-historical status so retrieval does not hand the agent a value that was true last year. The two-layer model we describe in episodic and semantic memory is exactly about keeping this layer clean.
Procedural memory
Procedural memory is how to do things: the multi-step recipe for deploying a release, the sequence the agent learned for resolving a class of support ticket, the tool-use pattern that worked last time. This is the kind that turns a capable-but-amnesiac agent into one that improves with experience, because a procedure that succeeded can be recalled and rerun instead of rediscovered. Procedural memory is written rarely and read often, and it benefits enormously from provenance: when a procedure fails, you want to know which run taught the agent that procedure, so you can retire the lesson rather than letting a bad habit calcify. Letta's whole design, descended from the MemGPT paper, leans on the agent editing its own procedural and semantic memory through tools rather than through a retrieval pipeline, which is one coherent answer to where procedural memory should live.
Episodic memory
Episodic memory is what happened: the log of specific interactions, in order, with their timestamps and their context. The user asked about migrations on Tuesday, the agent recommended a plan, the plan partly failed on Thursday. Episodic memory is the rawest and the highest-volume of the four, and it is the one the naive embed-every-turn approach abuses most badly, because most episodes are not worth keeping verbatim forever. The right move is to keep episodes as an append-only record with recency weighting, and to consolidate them over time: the durable facts get promoted into semantic memory, the useful patterns get promoted into procedural memory, and the bulk of the raw log decays. That consolidation pass is the slow clock of a memory system, the subject of offline consolidation.
Lay the four side by side and the design pressure is obvious. Working memory wants to be small and to expire. Semantic memory wants to be deduplicated and current. Procedural memory wants provenance. Episodic memory wants order and consolidation. No single flat store, and certainly no single context window, serves all four. The durable layer's job is to hold all four with their different rules and to hand the context window the right slice of the right kind at the right moment.
Inspectable beats magical: record, recall, lineage, correction
Most external memory products stop at two operations: write a memory, retrieve the top matches. That is better than a context window, but it quietly recreates the flat-store problem one level out. If all you can do is write and retrieve, you still cannot answer the two questions that separate a memory you can trust from a black box: why did you recall this? and that fact is wrong, how do I fix it? A retrieval index that only does similarity search will happily hand back a confident, plausible, out-of-date answer and give you no way to trace it or repair it. We took apart that exact failure in why memory layers recall the wrong thing, where the uncomfortable finding is that even good managed layers recall the right memory only about two times in three.
The loop memnode organizes itself around is four operations rather than two: record, recall, lineage, correction. Record and recall are the obvious pair. Lineage is the part most stacks drop: every recalled answer can be traced back to the specific memories that produced it, so show-me-the-evidence is always answerable rather than a matter of trust. Correction is the part that makes the store improve instead of rot: when a fact is wrong, you correct it in place, the correction is itself recorded with its own provenance, and the bad value stops being recalled without being silently erased from history. This is design-level positioning, not a claim about internals the product does not publish, but the shape of the loop is the point: a durable layer you can inspect and correct, not a magical retriever you have to take on faith. The longer story of how that loop was arrived at is in how memnode evolved from a graph database to a memory reasoning engine.
It does not store facts. It maintains beliefs, with evidence. It does not retrieve records. It reconstructs answers, with provenance.
What to do on Monday
If you are about to build an agent, or you are staring at one whose memory keeps evaporating, the practical moves follow directly from the four-workload split:
- Keep the context window for working memory only. The current goal, the live intermediate state, the last few turns. Prune it hard and let it expire when the task ends.
- Move semantic, procedural, and episodic memory into a durable layer outside the model. One store the agent queries, that persists across sessions, processes, and machines, so the same agent on a laptop and on a server read the same memory.
- Give every durable fact a status and a lifecycle. Current versus historical, so a correction supersedes instead of stacking a contradiction next to the old value.
- Insist on lineage and correction, not just write and retrieve. If you cannot ask why a memory was recalled or fix one that is wrong, you have bought a faster way to be confidently wrong.
- Consolidate episodes on a slow clock. Promote the durable facts and patterns, decay the raw log, so recall quality holds as the store grows.
The forum consensus is right, but it is only half an instruction. Stop putting memory in the context window, yes. The other half is what to put it in: a durable, queryable, inspectable layer that knows the difference between working, semantic, procedural, and episodic memory, and that can tell you where any answer came from. If you want the honest landscape of the tools competing to be that layer, and where a durable inspectable one fits among the frameworks, read the companion piece on agent memory frameworks in 2026.