memnode
Sign InSign Up
Back to Articles

Why Your AI Memory Layer Recalls the Wrong Thing (mem0, Zep, Letta, and the 64% Ceiling)

The memory-layer category exploded in 2026, but the same complaint follows every product: it remembers the wrong thing. The anatomy of four memory jobs, the LongMemEval recall ceiling (Zep 63.8% vs Mem0 49.0%), the three ways recall fails, and the structural fix.

memnode7 min read
agent memorymem0zeplettarecallbenchmarksvector database

The agent-memory layer became a category in 2026. Mem0 alone passed 41,000 GitHub stars and raised 24 million dollars; Zep, Letta (the project formerly called MemGPT), and Supermemory all have real traction. And yet the single most common complaint about all of them is the same sentence, repeated in every community thread: it keeps remembering the wrong thing. This is why.

One layer, four jobs

"Agent memory" is not one workload. It is at least four, and they want different storage:

  • Episodic recall: what was said last, in order. Recency-weighted, often best as a plain session log.
  • Semantic recall over a corpus: "what does our policy say about X." This is the one job vector search was actually built for.
  • Structured state: the user's current preferences, the active task, the last error. Explicit keys, current values, writes overwrite.
  • Cross-session continuity: what we worked on over weeks. Part episodic, part semantic, part structured, and the hardest of the four.

Most memory layers collapse all four into one similarity index. That single decision is the source of nearly every wrong recall, because three of the four jobs are not similarity problems at all. We make the deeper version of this argument in agent memory vs a vector DB.

The benchmark ceiling nobody quotes

The numbers are sobering. On LongMemEval with GPT-4o, Zep scores about 63.8% and Mem0 about 49.0%. Read that again: the best widely-used layer recalls the right memory roughly two times in three, and a popular alternative is close to a coin flip. Production teams feel exactly that. The marketing says "long-term memory"; the measured reality is a recall layer that is wrong a third to a half of the time. We keep current figures in the agent memory benchmarks writeup.

The three ways recall goes wrong

  1. Stale facts resurface. A value the user changed months ago is still in the index, semantically identical to the current one, and there is no recency or supersession to break the tie. The agent confidently cites the dead fact.
  2. Similar is not relevant. Similarity search returns what is close in embedding space, which is not the same as what answers the current need. "The user's last login was Tuesday" and "the user once asked about login bugs" are neighbors, and only one is useful right now.
  3. Contradictions both win. When a fact is corrected, a naive layer stores both versions as separate memories. Retrieval returns the pair, the model picks, and you have a one-in-two chance of the right answer with no way to enforce a single current value.

What actually fixes it

Not a bigger model or a better embedding. The fix is structural: route each memory job to storage that fits it, and give the recall path the three things similarity search lacks, namely recency, supersession, and provenance. Query structured state by key, not by similarity, so a stale value is simply not there. Mark corrections as supersessions so only the current value is retrievable. Carry origin so the agent can weigh a memory before acting on it, which is the same machinery that defends against memory poisoning and the same lineage and provenance layer that makes recall auditable.

A two-minute diagnostic

Write down the five questions your agent asks its memory most often. If they look like:

  • What is the user's current preferred format?
  • What was the last action, and did it succeed?
  • What task are we in the middle of?

then you have structured-state questions, and a similarity index will keep getting them subtly wrong. If instead they are "find policy text like this" or "surface tickets resembling this one," vector search is right and you should keep it, for that job only. The teams who escape the wrong-recall complaint are the ones who stopped treating memory as one store. The cleanest way to mix backends per job is the MCP memory server pattern.

Sources