mem0

@mem0ai

State of Memory in Agent Harness

Agent harnesses are where AI software actually runs. Cursor, Devin, Claude Code, Codex: these environments handle context, orchestrate tools, coordinate agents, and increasingly, manage memory. The harness, not the model, is increasingly the product that ships software.

Memory is where harness design gets hard.

Where does it sit? What persists when a session ends? It is a largely unsolved problem, and every major harness is solving it differently.

This post covers what each shipped, where each falls short, and what that gap says about what memory infrastructure has to do.

What Agent Memory Actually Is

Three different things get called memory, and the distinction matters because each has different failure modes.

Working memory is what lives in the context window during a session. It resets on session end; the compaction problem (what survives when the window fills) belongs here.

External memory is anything persisted outside the weights: vector stores, knowledge graphs, files. It survives sessions; the weights don't change. This is where essentially all production memory lives in 2026.

Parametric memory is knowledge encoded into weights via gradient descent, shaped by the training loop the harness feeds. It generalizes by applying rules rather than retrieving examples. Zero production deployments in 2026.

(The cognitive-science split, semantic / episodic / procedural, describes what kind of information is stored; the three tiers above describe where it lives.)

The paper "Contextual Agentic Memory is a Memo, Not True Memory" (arXiv:2604.27707) formalizes the ceiling: retrieval needs Ω(k²) stored examples to match what parametric memory does with O(d) weight updates. Every system below operates within it.

What Major Harnesses Shipped [Overview]

1.@AnthropicAI: Claude Code

Two tracks. CLAUDE.md is human-authored config (conventions, instructions), read at session start. Auto-memory is Claude-written notes from a background extraction agent, stored under ~/.claude/projects/<repo>/memory/ around a MEMORY.md index capped at 200 lines / 25KB, in four categories: user, feedback, project, reference.

Retrieval shapes the limits: each turn Claude Code makes a separate call to a smaller model with a manifest of filenames and descriptions, and the model picks which files load. No embeddings, max five files per turn, and silent truncation past the cap (a dropped file gets no warning because it never loads).

Shortcoming. Selection is by filename, not semantic search, so a relevantly-named file wins over a relevant one. Team sharing exists behind a TEAMMEM flag, but underneath it's local, repo-scoped markdown with no semantic index.

Read more here:

7.@awscloud Bedrock AgentCore

AgentCore is AWS's hosted agent platform: its Runtime is the harness layer (AWS's analog to Managed Agents) and Memory is its managed service. Memory runs three async extraction strategies (semantic facts, preferences, narrative summary), ~20–40s to extract and ~200ms to retrieve; changed facts are marked INVALID rather than deleted, preserving lineage. Published: LoCoMo 70.58, PrefEval 79, PolyBench-QA 83.02.

Shortcoming. It's AWS-specific (ecosystem lock-in), and its published LoCoMo sits well below leading memory systems.

8.@windsurf

Windsurf's memory is generated and managed by its engine, Cascade, with no developer workflow: local, workspace-scoped files at ~/.codeium/windsurf/memories/ capturing codebase patterns and conventions.

Shortcoming. What's captured is Cascade's call, not the developer's; memory is workspace-scoped (invisible across projects) and local (no cross-device or team sharing).

9.Cognition Devin

Devin splits memory in two. Knowledge is human-curated trigger-content facts (no auto-capture); DeepWiki is reference docs (30 pages, 100 notes, 10,000 chars each). Devin suggests Knowledge after sessions, but a human approves before anything is stored.

Shortcoming. The approval gate keeps quality high but is friction: teams that don't review accumulate nothing. Limits are modest, and Knowledge is curated for Devin, so it doesn't transfer to other tools.

Memory Benchmarks Are the Weak Link

The benchmarks the field uses to measure memory are mostly bad. They test recall of facts from past conversations, they're near-saturated, and a high score doesn't predict better decisions.

LoCoMo is the worst of the common ones. Ten conversations make comparisons unreliable, many questions don't require memory (a trivial grep baseline scores ~74%), and adversarial questions share surface similarity with their targets, so models win by pattern-matching. LongMemEval is still okay: 500 curated questions across five abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge updates, abstention), scaling toward 1.5M tokens; still recall-centric but a real test.

The deeper problem is what none of them measure. MemoryArena (arXiv:2602.16313) tests memory that must guide action, and systems that near-saturate LoCoMo and LongMemEval fail at it; Anatomy of Agentic Memory (arXiv:2602.19320) makes the critique formal (near-saturated, measuring similarity not task utility).

And none test at production scale: standard benchmarks cap near 1.5M tokens while production agents hit 10M+. BEAM (ICLR 2026) is the only one built for that range. The honest conclusion: the field needs a new memory benchmark, and leaderboard scores should be read skeptically, including the ones below.

What Research Shows Is Still Broken

The stability-plasticity dilemma relocated. Moving to external memory didn't end catastrophic forgetting. "When Continual Learning Moves to Memory" (arXiv:2604.27003) shows new and old memories compete for retrieval slots exactly as they once competed for weights; raw trajectories from easy tasks hurt harder ones (forward transfer −9.5%).

Selective forgetting is unsolved. MemoryAgentBench (arXiv:2507.05257) names four competencies; systems handle retrieval but not selective forgetting (unlearning a stale fact while keeping the structure around it).

Memory is an attack surface. "No Attacker Needed" (arXiv:2604.01350) measured 57–71% cross-user contamination under normal usage; poisoning attacks succeed at 6–38% (arXiv:2601.05504).

The Pattern in the Shortcomings

The same gaps recur. Storage is bounded and local (Claude Code 25KB, Hermes 2,200 chars, Codex's 5,000-token load). Retrieval is mostly keyword (Claude Code by filename, Codex by grep, Hermes by FTS5); the two that search semantically are local-and-compaction-limited (OpenClaw) or cloud-locked (AgentCore). Memory is harness-scoped, so Claude Code's memory means nothing to Codex. Staleness handling barely exists (Copilot aside). And isolation is an afterthought, hence the contamination numbers. These are the limits of the harness boundary.

Mem0

Mem0 is built for the case where the harness boundary isn't the end of the problem. Its architecture is a hybrid: a vector store for semantic retrieval, a knowledge graph for relational reasoning, and key-value for fast metadata.

The v3 algorithm (April 2026) moved to single-pass ADD-only extraction, multi-signal retrieval (semantic + BM25 + entity linking in one pass), and entity linking inside the vector store, dropping v2's external graph DB.

It does this at ~6,900 tokens and 1.44s per query, against ~26,000 tokens and 17.12s for full-context retrieval.

Against the gaps: an external store with no cap; multi-signal retrieval that finds last month's auth-endpoint discussion even when the phrasing differs; identity-scoped memory that one user's namespace can't leak across, targeting the 57–71% contamination rate. And it isn't theoretical: Mem0 ships for every harness above, a Claude Code plugin, a Codex MCP server, first-class Hermes and OpenClaw providers, native AWS Strands integration, across 21 frameworks and 20 vector stores. Memory becomes infrastructure, not a per-harness feature.

Where Things Stand

Memory is now infrastructure: every major harness shipped it, because an agent that's capable within a session but amnesiac across them is fundamentally limited.

The harness-native implementations made real progress, but they break at the same boundary, bounded local storage, keyword retrieval, harness scoping, weak staleness handling, and isolation gaps.

The benchmarks meant to measure all this are themselves weak, and the one that tests production scale (BEAM) is the one most systems don't report.

Overall, these are the gaps Mem0 is working hard to fill: memory that is portable, semantically searchable, cross-agent, and scaled to the token volumes production agents accumulate.

In Context #11

This blog is part of In Context, a@mem0ai blog series covering AI Agent memory and context engineering.

Mem0 is an intelligent, open-source memory layer designed for LLMs and AI agents to provide long-term, personalized, and context-aware interactions across sessions.

Get your free API Key here:app.mem0.ai

or self-host mem0 from our open source github repository

References

Contextual Agentic Memory is a Memo, Not True Memory (April 2026)

MemoryArena (February 2026, Stanford/UCSD/Princeton)

Anatomy of Agentic Memory (February 2026)

When Continual Learning Moves to Memory (April 2026)

MemoryAgentBench, ICLR 2026

No Attacker Needed: Cross-User Contamination (April 2026)

Memory Poisoning Attack and Defense (January 2025)

Anthropic Engineering: Scaling Managed Agents (April 2026)

OpenAI: Codex memories documentation