Agent Memory Benchmarks 2026: The Real Numbers (LongMemEval, LOCOMO, and the Gaps)
OMEGA at 95.4% LongMemEval. Mastra at 94.87%. Mem0 at 66.9% LOCOMO. The scores are real but the benchmarks measure different tasks. Here is what each one actually tests, what they miss (cost, lineage, poisoning), and how to pick a memory system in May 2026.
Agent memory benchmark scores started getting published in waves through 2026. The numbers are dramatic. OMEGA claims 95.4% on LongMemEval. Mastra's Observational Memory says 94.87% with a gpt-5-mini actor. Mem0 sits at 66.9% on LOCOMO. Zep/Graphiti at 71.2% on LongMemEval.
Useful context for picking a memory system, partly. Misleading if you take them at face value, mostly. Here's what the benchmarks actually measure, what they don't catch, and how to read the numbers without ending up with the wrong system.
The published numbers, side by side
Benchmark System Score Model
─────────────────────────────────────────────────────────────────────
LongMemEval OMEGA 95.4% GPT-4.1
LongMemEval Mastra Observational Memory 94.87% gpt-5-mini
LongMemEval Emergence AI (RAG-based) 86% -
LongMemEval Zep / Graphiti 71.2% gpt-4o
LongMemEval Mem0, Letta, others not published
LOCOMO Mem0 66.9% -
LOCOMO OpenAI Memory 52.9% -
LOCOMO others not publishedWhat LongMemEval actually tests
LongMemEval (ICLR 2025) is a recall benchmark. About 40 multi-turn conversations, each one thousands of tokens long, with questions about specific facts buried inside them. Score = percentage of questions answered correctly using whatever memory the system has stored.
It is a clean test. It is also a small one. The benchmark doesn't capture:
- Behavior at hundreds or thousands of sessions (the regime real production agents hit)
- Memory drift when stored facts contradict each other later
- Memory poisoning resistance (prompt injection through a memory entry)
- Per-tenant isolation under shared backend
- Cost per recall at production volume
A 95% score on LongMemEval means a system is competent at "recall a fact from a recent conversation." It does not mean the system is operationally ready for a multi-tenant production deployment.
What LOCOMO actually tests
LOCOMO is a different shape. Longer conversations, more inference-heavy questions ("what did the user mean when they said X three sessions ago, given that Y is also true?"). The scores are lower across the board because the task is harder. Mem0's 66.9% vs OpenAI Memory's 52.9% is a real-feeling gap (~25% relative improvement), but neither score crosses the threshold where you'd trust the system unsupervised on hard cases.
The benchmark vs benchmark vs benchmark trap
OMEGA's 95.4% LongMemEval and Mem0's 66.9% LOCOMO are not directly comparable. Different benchmarks, different task structures, different difficulty. If you read marketing copy that compares them as if they were equivalent, that's the marketing copy's problem, not yours.
The honest read: if you care about LongMemEval-style recall, OMEGA and Mastra both look strong. If you care about LOCOMO-style inference, Mem0 leads what's been published, but it has a smaller delta from the next-best system than the LongMemEval leaders do from theirs. If you care about something a benchmark hasn't measured yet, you're flying blind regardless of the score.
What's not benchmarked but matters more
Production cost per recall
None of the published numbers include "and the production cost was $X per 1,000 recalls." Mem0 routes through a cloud API. Zep routes through a cloud API. OMEGA's deployment shape isn't clear from public materials. Cloud-routed memory adds latency and cost that the benchmark doesn't penalize.
At memnode, this is the trade-off we picked the other side of. Local-first deployment, no per-recall API cost, the latency is sub-millisecond on disk. The benchmark cost is "we don't have a LongMemEval score yet because the benchmark is set up for cloud APIs." The production cost is "you can run a million recalls a day for the price of a small SSD."
Lineage and correction
When the agent recalls something wrong, what do you do about it? Most memory systems treat memories as opaque blobs. You can delete the wrong entry, but you can't see why the agent remembered it (which source, which session, which inference chain). You can't correct the underlying fact and have all dependent memories update.
This isn't a benchmark category yet. It probably should be. The systems that ship lineage-first (memnode, some research systems) trade off raw recall accuracy for inspectability. The systems that ship recall-first trade off the ability to debug what they got wrong.
Memory poisoning
The OWASP Agentic Top 10 (ASI02) calls out context injection through stored memory as a top risk. None of the published memory benchmarks include adversarial test cases. A system that scores 95% on recall and 0% on adversarial robustness is shipping a vulnerability. Until benchmarks measure both, you have to evaluate this yourself.
How to actually pick a memory system in May 2026
Forget the scoreboard. Three questions matter more:
- Is your data ok in someone else's cloud? If yes, the hosted options (Mem0, Zep, OMEGA cloud) are on the table. If no, you need a local-first option (memnode, self-hosted Zep, your own).
- What's your recall pattern? Many short sessions favor "session brief" + dedup, which graph-backed systems handle well. Many long sessions favor strong long-context retrieval, which most systems support. Hundreds of repeated sessions favor systems with explicit memory consolidation rather than naive append-only stores.
- How often will you need to correct what the agent remembered? If often, lineage matters more than raw recall. If rarely, you can buy the highest-benchmark system and treat memory as a black box.
What memnode optimizes for
We didn't publish a LongMemEval score because we'd have to plumb the benchmark harness through our local-first storage layer, and the benchmark's design assumes a cloud API. We're working on it. In the meantime, the things memnode is actually built for:
- Local storage with optional hosted relay (you keep the data)
- Lineage as a primitive (every memory entry knows where it came from)
- Namespace scoping (multiple repos / projects / personas, no cross-talk)
- Inspectable corrections (fix the source, dependent memories update)
For the agents we built memnode against, those four things matter more than a percentage point on a benchmark designed for a different deployment shape. Your priorities may differ. The point is to choose against your priorities, not against the press release.
The next benchmark we'd actually trust
A benchmark that runs 1,000 sessions, includes adversarial poisoning attempts, measures cost per recall, and reports per-session drift, would tell us something the current scoreboards don't. Until then, the published numbers are a useful starting filter (rule out systems that score below 50% on their own preferred benchmark) and a poor final answer.