Featured

Four Hermes-Inspired Memory Features, Synthesized From a Graph

NousResearch Hermes describes four useful agent-memory primitives (session brief, procedural outcomes, end-of-session consolidation, FTS recall) and stores them in flat files. We re-implemented them against memnode's knowledge graph. The change is small in code and large in operational consequence: less plumbing per feature, dedup and lineage become free, and EMA-graded procedures self-correct without a separate analytics pipeline.

memnode•May 15, 2026•7 min read

architecturememoryagentsmcphermes

The Hermes paper from Nous Research is one of the more thoughtful explorations of agent memory from the last year. Their architecture exposes a small, deliberate set of memory primitives (session briefs, procedural outcomes, end-of-session consolidation) and lets the agent compose its own context from them rather than dumping everything into the prompt window.

What Hermes does NOT do is store memories in a graph. The reference implementation uses flat files: append-only logs per memory class, with periodic offline consolidation jobs. That's the right call for a paper focused on the memory model itself; it makes the primitives easier to reason about. It's the wrong call once you want those primitives to interoperate, dedup across sessions, and answer "what does the agent know about X" in a single round-trip.

So we shipped four Hermes-inspired surfaces in memnode this week, but synthesized from the existing knowledge graph rather than backed by separate flat files. The features carry their weight because the graph is already there. We're not adding new storage, we're adding new lenses over what's stored. This post walks through the four: what each one is for, what we changed from the Hermes reference, and where the graph backing matters.

1. Session-start brief

The first surface is recall_brief (HTTP POST /v1/brief, MCP recall_brief). The agent calls it once at session start. It returns a compact paste-able text block of the highest-value active memories, ranked and budget-bounded.

The ranking is a weighted blend:

score = 0.35 * salience
      + 0.30 * confidence
      + 0.20 * recency
      + 0.15 * authority

Pinned memories are surfaced first regardless of score. After ranking we deduplicate by lowercased summary so two paraphrases of the same fact don't both land in the brief. The output is sized to fit a caller-supplied character budget (clamped to 64..MAX), so the brief never blows the agent's context window.

The Hermes reference returns a session-start brief too, but constructs it by walking a fixed-window log per memory class. That works in a single-session-per-process model. It falls apart when memories are written and read across overlapping sessions, when the same fact is restated three times across a week, or when authority needs to break ties (user-said-so beats system-said-so beats agent-inferred). The graph already carries that metadata on each node, so the brief can use it without a separate index.

The thing that matters operationally is what the brief is NOT. It does not synthesize an answer. It does not summarize. It does not call an LLM. It returns a ranked, deduplicated, budget-bounded slice of the existing memory, and stops. The agent decides what to do with it. That separation keeps the brief deterministic, debuggable, and free.

2. Procedural outcome reporting

The second surface is report_procedure_outcome (HTTP POST /v1/procedures/outcome, MCP report_procedure_outcome). After the agent runs a stored procedure (say, "deploy the frontend"), it reports back whether the outcome was a success or a failure. The procedure node's success_rate is updated via an exponential moving average:

new_rate = 0.9 * old_rate + 0.1 * outcome

The usage_count ticks up either way. Over time, frequently-successful procedures float to the top of the recall ranking; chronically-failing ones sink. No human has to grade them, no separate analytics pipeline has to consume them, and the decay constant is small enough that a procedure that used to work but suddenly stops working will fall out of favor within a few invocations rather than a few hundred.

Hermes treats procedural memory the same way semantic memory is treated: an append-only log. There is no first-class outcome tracking; if you want to know whether a procedure worked, you grep the log yourself.

The change here is small in code and large in operational consequence. EMA-tracked outcomes turn procedural memory into a self-correcting subsystem. You don't have to manually deprecate a procedure that's no longer useful; the rank does it for you. And because the EMA lives on a graph node, not in a side table, every recall query sees the up-to-date success_rate without needing to join across systems.

3. End-of-session consolidation

The third surface is session_end (HTTP POST /v1/session/end). When the agent declares a session done, three things happen on the graph:

Dedup: equivalent memories get merged. The graph already has a notion of canonical-form-per-summary, so dedup at session-end is a sweep over the session's writes against the canonical index. Duplicates don't get logged-and-ignored; they get unified into one node with merged source attribution.
Episodic to semantic promotion: an episodic memory ("on Tuesday the user said they prefer dark mode") that's been observed enough times across sessions gets promoted to a semantic memory ("the user prefers dark mode"). The promotion threshold is per-class.
Decay: low-salience, low-confidence, never-recalled memories from prior sessions are eligible for eviction. Decay isn't deletion. Archived nodes are still in the graph for audit, they just stop scoring in the recall pipeline.

The session itself is a NetworkSession value persisted to <data_dir>/system/sessions.json via atomic temp-and-rename. Sessions survive process restart, which sounds basic but matters more than it should: the Hermes reference is in-memory only, so a restart drops every session's open state.

The Hermes paper describes consolidation as an offline job that runs against the corpus periodically. We run it at session boundary because that's when the agent can usefully signal "I'm done; safe to compact." The graph has all the lineage edges already (SUPERSEDES, OBSERVED_FROM, MEMBER_OF), so consolidation becomes a graph rewrite rather than a multi-pass log scan.

Closed sessions are kept on disk for telemetry but not rehydrated into the live map. If you want to know what the agent did during a session that ended yesterday, the trail is there; it just isn't taking up working memory.

4. FTS recall signal

The fourth feature isn't a new endpoint. It's a new ingredient in the existing recall pipeline. Vector embeddings are great at capturing semantic similarity. They are bad at exactly the kind of text agents need to recall: rare identifiers, flag-like strings, error codes, file paths, command-line invocations, CJK substrings the tokenizer chops up wrong.

We added a character-trigram inverted index (text_index.rs) with Jaccard-like overlap scoring, seeded into the recall candidate pool via apply_fts_seeding. The index is built lazily, the first query that needs it triggers a build, and rebuilt only when GraphEngine.mutation_version advances. There's no on-disk format and no migration to worry about.

The result: if you stored --target=tst1.supercraft.host, the agent can recall it later by remembering the substring tst1.supercraft, not just by approximate semantic similarity to "test server". Vector search would either miss this entirely or surface it at rank 47 next to a dozen unrelated tokens.

Hermes uses vector plus keyword. We use vector plus keyword plus trigram FTS. The pattern that emerged in production: you want as many cheap orthogonal recall signals as you can afford, blended at the candidate-selection stage, not at the final-ranking stage. Adding FTS as a third orthogonal signal materially improved recall on the queries that vector and keyword both miss. It's the same observation that drove search engines from "BM25 OR semantic" to "BM25 plus semantic plus LSI" two decades ago.

Why graph backing matters

The pattern across all four features is the same. Hermes describes the right primitives, and a graph backend lets us implement them with less ceremony than the reference does. Pinned-first ranking is a property on the node. Dedup is a graph rewrite. EMA outcomes are a numeric property updated in place. Session-end consolidation walks lineage edges that already exist. FTS is a side index keyed by node id.

If you were going to build session briefs, EMA-graded procedures, and end-of-session consolidation against flat files, every one of those features would need its own bespoke mini-store and its own bespoke index. With a graph it's four small features against shared substrate. The cost-of-adding-a-feature curve flattens out.

The opposite cost (moving an existing flat-file memory subsystem onto a graph) is real and not what we'd recommend mid-project. But for a green-field memory layer, the graph backing is the lever that makes Hermes-style features cheap enough to ship in a single commit.

All four features are live in the data-plane today. The MCP tool surfaces are exposed; the HTTP endpoints are stable. If you're using memnode, recall_brief is the easiest one to integrate first: one call at session start, one paste-able text block, no other plumbing. The procedural outcome loop pays for itself faster as the agent runs more procedures. Session-end consolidation is invisible until you start measuring brief quality across weeks; then it shows up.

The four-feature shape (brief, outcomes, consolidation, FTS) is also a useful integration template if you're building your own memory layer. Even if you don't use memnode, the question of "what minimum primitives does an agent's memory subsystem need to expose" is worth taking from Hermes seriously. The vector-DB-as-default critique covers the storage side; this set of primitives is the surface side. They are independent decisions, and both need answering.

Adjacent concerns

Persistent agent memory raises two questions that sit next to the primitives we covered above:

Security. Memory poisoning, cross-tenant leakage, and indirect prompt injection through stored memory are real attack surfaces. The 2026 LLM Security Checklist's Memory category (controls M-1 through M-6) is the auditable set: identity-scoped namespaces, source-signed entries, write quotas, user-inspectable memory. Walk it against any memnode deployment that holds untrusted writes.

Billing. If you meter memory operations per tenant (recall, write, consolidate), each is a billable event. UsageBox handles the metering pipeline; the open-source storage engine is usagedb on GitHub. The pattern is the same shape as token-metering for raw LLM calls but the per-event cost model differs.