I Spent 5 Days Debugging My OpenClaw Agent's Memory. Here's Everything I Learned.

My agent's name is Chiti. It runs on Telegram, handles customer support for two SaaS products, drafts tweets, manages invoices, and coordinates with my co-founder across timezones. It's the closest thing I have to a junior employee.
And for weeks, it kept forgetting things.
Not in a subtle way. I'd spend an hour configuring a daily cron job, switch models, and the next session Chiti would act like we'd never spoken. I'd reference a decision from two days ago and get a blank stare. I'd ask it to continue a task and it would start from scratch.
So I stopped building features and spent 5 days whenever I get time, just fixing memory. This is everything I found, everything I broke, and everything that actually worked.
Day 1: The Agent Forgets Everything After Long Conversations
The first problem was simple to describe and painful to diagnose.
After long conversations, Chiti would start losing earlier context. Not gradually, it would just vanish. Things I told it 20 messages ago were gone. Decisions we made at the start of the session? Never happened.
The culprit was compaction. When the conversation fills up the context window, OpenClaw compresses older messages into a summary to make room for new ones. The summary captures the gist but drops specifics. Names, numbers, exact decisions - gone.
This is by design. The context window is finite. But the default behavior treats everything equally, which means your carefully crafted instruction from message #3 gets the same treatment as casual small talk from message #7.
What I did:
I enabled memory flush before compaction. This tells the agent to write important context to disk before the compressor runs.
When the session approaches the context limit, OpenClaw triggers a silent turn that reminds the agent to save durable facts to memory/YYYY-MM-DD.md before compaction wipes them. The agent writes what matters, compaction runs, and the important stuff survives on disk even if the context summary loses it.
What I learned:
Compaction is not your enemy. Losing information during compaction is. The fix is making sure anything worth remembering gets written to a file before the compressor touches it. If it's only in the context window, it's temporary. If it's on disk, it survives.
Day 2: Search Returns Garbage
With daily logs accumulating and MEMORY.md growing, I needed the agent to actually find things. The built-in memory search was returning irrelevant results or missing obvious matches.
The issue was the search backend. OpenClaw's default SQLite-based search uses vector embeddings (semantic similarity) to find relevant chunks. It works for broad queries but struggles with exact matches. I'd search for a specific client name and get results about a completely different topic that happened to use similar language.
What I did:
I switched to QMD as the memory search backend. QMD combines BM25 (keyword matching) with vector embeddings and a reranker. So when I search for "Charles payment failure", it finds results that contain those exact words AND results that are semantically related, then reranks them by relevance.
I also configured the QMD paths to include my learnings folder:
What I learned:
Pure semantic search sounds good in theory but fails on proper nouns, specific numbers, and exact phrases. Hybrid search (keywords + vectors + reranking) is significantly better for real-world agent memory. If your agent can't find something you know is in its files, the search backend is probably the bottleneck, not the files themselves.
Day 3: The Agent Finds It But Doesn't Use It
This was the most frustrating day. I confirmed that search was working, I could manually query and get the right results. But during actual conversations, Chiti would not retrieve relevant context even when it clearly existed in memory.
The problem was that retrieval is not automatic. The agent has to decide to search. And if the conversation doesn't trigger the right cues, it won't look things up.
What I did:
I added explicit retrieval instructions to the boot sequence. Instead of hoping the agent would search when needed, I told it when to search:
markdown
Before starting any task:
- Search daily logs for related context
- Check LEARNINGS.md for rules about this type of task
- If a client is mentioned, search for their history
I also built a retrieval test. I'd plant a specific marker in the daily log β something like "MARKER: 2026-02-20 β Remember to always check git status before claiming code is pushed." Then I'd wait, start a new session, and ask: "What was the marker from yesterday?" If the agent found it, retrieval was working. If not, something was broken.
What I learned:
There's a difference between "the information exists" and "the agent uses the information." You need both. Search infrastructure handles the first part. Boot instructions and retrieval habits handle the second. Test both separately.
Day 4: Making It Compaction-Safe
By now I had memory flush, hybrid search, and retrieval instructions. But I kept losing context in a specific scenario: very long sessions where compaction ran multiple times.
The problem was that memory flush only triggers once per compaction cycle. If the session was long enough for two or three compactions, only the first one got the flush treatment. Everything after that was at risk.
What I did:
I configured context pruning to work alongside compaction:
This aggressively prunes old context after 6 hours while keeping the last 3 assistant responses. Combined with memory flush, this means the agent writes important stuff to disk early, and old context gets cleaned up before it causes overflow.
I also added a MARKER test protocol: after any significant configuration change, I plant a marker in the daily log and test retrieval across compaction boundaries. If the marker survives, the change worked. If not, something broke.
What I learned:
Long sessions are where memory systems actually get tested. Short conversations rarely hit compaction. It's the 2-hour deep work sessions where you lose context and can't figure out why. Test your memory system under load, not just in quick chats.
Day 5: The System Prompt Was 28% Bloated
This was the day everything clicked. I ran /context detail and stared at the numbers.
My agent was loading 11,887 tokens of system prompt before it even read my message. 51 skills, 20 of which I'd never used. MEMORY.md was 200 lines of company wiki loaded on every single session. And I had two competing boot sequences - one in BOOT.md (which OpenClaw doesn't even recognize) and one buried 200 lines deep in AGENTS.md.
Worst of all, every time I switched models, Chiti forgot everything. No handover protocol. No write-back of current context. Just gone.
The root cause:
OpenClaw auto-reads these files on every new session: AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, HEARTBEAT.md, MEMORY.md.
Everything else β LEARNINGS.md, daily logs, docs, reference files β the agent has to read them itself using tools. If the instruction to read those files isn't in one of the auto-loaded files (specifically AGENTS.md), the agent will never see them.
My BOOT.md had the entire boot sequence. But OpenClaw doesn't auto-load BOOT.md. So the instructions just sat there, unread, doing nothing.
What I did:
I did a full audit and cleanup:
The boot sequence now looks like this:
markdown
Before doing ANYTHING:
1. Read USER.md
2. Read learnings/LEARNINGS.md
3. Read memory/YYYY-MM-DD.md (today + yesterday)
4. Read MEMORY.md (main session only, never in groups)
5. Read PROTOCOL_COST_EFFICIENCY.md
6. Print: LOADED: USER | LEARNINGS | DAILY | MEMORY | PROTOCOL
The write discipline:
markdown
After every task:
1. Log decision + outcome β memory/YYYY-MM-DD.md
2. If mistake β append to learnings/LEARNINGS.md
3. If significant context β update MEMORY.md (only during heartbeat reviews, never directly during tasks)
The handover protocol:
markdown
Before session end or model switch:
Write HANDOVER section to memory/YYYY-MM-DD.md:
- What was discussed
- What was decided
- Pending tasks with exact details
- Next steps remaining
Results:
What I learned:
The real fix wasn't adding more files. It was removing the ones that weren't doing anything. Every token in the system prompt is overhead the agent carries on every single message. Unused skills, bloated memory files, files the system doesn't even read - they all add up silently.
The Rules I Wish I Knew On Day 1
After 5 days of breaking things and fixing them, these are the rules I'd give anyone setting up OpenClaw memory:
1. Only these files auto-load: AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md, HEARTBEAT.md, MEMORY.md.
Everything else needs an explicit read instruction in AGENTS.md. If it's not in the boot sequence, the agent won't see it. BOOT.md is not a real thing in OpenClaw. I had one for weeks. It did nothing.
2. Boot sequence goes at the top of AGENTS.md.
Not in the middle. Not at the bottom. The very top. Auto-loaded files get injected into the system prompt, so the boot instructions need to be the first thing the agent processes.
3. Write discipline matters more than read discipline.
Most people set up files for the agent to read but never enforce writing back. If the agent doesn't log decisions, outcomes, and mistakes to disk, those things only exist in the context window. And the context window gets compacted. Write-back is how temporary context becomes permanent memory.
4. Never write directly to MEMORY.md during tasks.
Daily logs are raw and append-only. MEMORY.md is curated long-term memory. If you let the agent dump anything into MEMORY.md, it bloats into a 200-line mess within weeks. Curate MEMORY.md during periodic reviews (heartbeat or cron) by distilling insights from recent daily logs. I learned this from a fellow OpenClaw user who caught his agent doing exactly this β bloating MEMORY.md with uncurated noise until it was useless.
5. LEARNINGS.md is the most underrated file.
Every mistake the agent makes should become a one-line rule. "Never claim code is pushed without checking git status." "Don't read full MEMORY.md in group chats." "Always confirm the user's timezone before scheduling." These rules compound. After a few weeks, your agent has a personal operations manual built from its own failures.
6. Test retrieval, not just storage.
Storing information and retrieving it are different problems. I've had files indexed and searchable but never accessed because the agent didn't know to look for them. Plant markers, test across sessions, test across model switches. If the agent can't find what you stored yesterday, the storage doesn't matter.
7. The handover protocol is the model-switch fix.
OpenClaw agents lose all context when you switch models. The new model starts with a fresh context window β it only sees the auto-loaded files. Without a handover protocol that dumps current state to the daily log before the switch, the new model has no idea what was happening. This was my single biggest pain point for weeks.
8. Run /context detail regularly.
This command shows exactly what's eating your tokens. Skills you forgot you installed, files that grew without you noticing, tools you never use. I found 20 unused skills burning 3,000 tokens per session. That's 3,000 tokens of overhead on every single message, for features I'd never touched.
9. Hybrid search beats pure semantic search.
BM25 (keywords) + vectors (meaning) + reranking gives significantly better results than vectors alone. Client names, specific numbers, exact phrases β semantic search misses these. Keyword search catches them. Use both.
10. Compaction is not the enemy. Unwritten context is.
I spent days fighting compaction before realizing the fix was simpler: make sure anything important gets written to a file before compaction runs. Memory flush handles this automatically. If it's on disk, it survives compaction. If it's only in the conversation, it's at risk.
My Current Setup
For reference, here's what my workspace looks like now:
workspace/
βββ AGENTS.md (boot sequence + write discipline + handover protocol)
βββ SOUL.md (personality and behavior)
βββ IDENTITY.md (name, role)
βββ USER.md (owner info)
βββ TOOLS.md (tool usage guidelines)
βββ HEARTBEAT.md (autonomous check-in behavior)
βββ MEMORY.md (curated long-term memory, ~90 lines)
βββ PROTOCOL_COST_EFFICIENCY.md
βββ learnings/
β βββ LEARNINGS.md (rules from mistakes)
βββ memory/ (daily logs: YYYY-MM-DD.md)
βββ docs/ (reference docs moved out of MEMORY.md)
β βββ tweetsmash-arch.md
β βββ knowledge-transfer.md
β βββ infrastructure.md
β βββ group-chat-rules.md
βββ skills/ (32 skills, down from 51)
System prompt: 8,529 tokens. Session tokens: 14,627 out of 200,000 context window (7.3%). The agent boots, reads what it needs, writes what it learns, and hands off context before model switches.
It took 5 days to get here. Most of it was unlearning the assumption that more files equals better memory. It doesn't. Discipline does. My experiment still continues.
I'm building TweetSmash and LinkedMash β social media bookmark tools with my co-founder. I share what I learn about running OpenClaw agents in production on X: @code_rams

