Building a Memory System for AI Agents

Every serious agent project eventually hits the same wall. You have a capable model. You have tools. You have a context window that gets populated on each turn. And then the agent does something sensible in turn three that directly contradicts something it decided in turn fourteen, because turn fourteen cannot see turn three unless you explicitly carry it forward. The context window is not memory. It is a whiteboard that gets erased.

The agent’s experience right now, every time you open a new chat

SESSION 1

Learns your background. Understands your stack. Builds context. Gives you a great answer.

BETWEEN SESSIONS

Everything gone. Whiteboard erased. Agent has no idea you exist.

SESSION 2

Hi, I’m an AI assistant. How can I help you today?

You have had this conversation before. You will have it again tomorrow.

The standard answer is a vector database. Embed your history, store it, retrieve the most similar chunks at the start of each turn. This is the solution that most teams reach for, and I understand why. It is fast to implement, there are good hosted options, and it solves the literal retrieval problem.

What it does not solve is the structural memory problem. A vector store tells you what is semantically similar. It does not tell you what caused what, what was decided and why, what the agent learned that supersedes something it believed earlier, or how a series of events connect into a coherent episode. You can retrieve the fact that the agent mentioned a rate limit. You cannot easily query whether that rate limit caused a decision to switch APIs, or whether that decision was later corrected when the agent discovered the rate limit only applied to unauthenticated requests.

I built AgenticMemory to hold that structure.

The first design did not hold its own weight

My first implementation was a SQLite database with a normalized relational schema. A nodes table for cognitive events, an edges table for relationships. The schema was clean in the first week.

The SQLite era, week by week

WEEK 1

Clean schema. nodes table, edges table. Feels good.

WEEK 2

Added Contradicts edge type. Also changed confidence from INT to FLOAT. Migration script #1.

WEEK 3

Added PartOf and TemporalNext. Schema drift beginning. Migration scripts #2 and #3.

WEEK 4

Writing SQL to traverse CausedBy chains. The query is 14 lines of recursive CTEs. This is wrong.

I was building a graph model on a relational substrate. Started over.

The real problem was not the migrations. It was that SQL is clumsy at graph traversal. “Starting from this decision, traverse CausedBy edges backward to the facts that grounded it, filtering for nodes that have not been superseded” — those queries became increasingly baroque as the edge type vocabulary grew. I acknowledged that and started over.

The .amem format

The new format is binary. The file starts with a 4-byte magic sequence (AMEM), followed by a fixed 64-byte header. Then come node records, edge records, a compressed content block, and a feature matrix.

.amem file anatomy

AMEM
64-byte header
NODE RECORDS (72 bytes each)
EDGES (32b)
CONTENT [LZ4]
FEATURES (128-dim f32)
72-byte node record layout:
node_id (8b)
event_type (1b)
confidence f32 (4b)
decay_λ f32 (4b)
access_count (4b)
created_at (8b)
last_access (8b)
content_offset (8b)
feature_offset (8b)
reserved…
Fixed width = seek to node N is O(1): header_size + N × 72, read 72 bytes. Done.

The tradeoff: schema is compiled in. Changing a field means a migration tool, not ALTER TABLE. What SQL would not give me is the O(1) seek behavior or the contiguous feature matrix layout. I made the tradeoff deliberately.

The EventTypes and why Correction is the interesting one

There are 6 EventTypes: Fact, Decision, Inference, Correction, Skill, and Episode.

Correction is the one that shaped the whole design. When an agent learns that something it believed was wrong, the naive response is to delete the old node and replace it with the correct one. I rejected that early.

How corrections work — the append-only graph vs. naive deletion

NAIVE DELETION

Fact: rate limit is 100 req/min

Fact: rate limit is 1000 req/min

Why did the agent switch APIs in March? No record. History erased.

SUPERSEDES CHAIN

Fact: rate limit is 100 req/min [confidence: 0.0]

↑ Supersedes

Correction: rate limit is 1000 req/min [confidence: 0.9]

Original fact preserved. Correction chain queryable. Belief state at any point in time reconstructible.

Deletion loses the history of what the agent believed and when it believed it. If an agent made a series of bad decisions grounded in a Fact that was later found to be wrong, you need the original Fact to still exist to reconstruct the belief state at the time of each decision.

The memory store is append-only in practice. Nothing gets deleted. The history is preserved. You pay in storage size, but cognitive event records are small and the LZ4 compression keeps the total footprint reasonable.

The seven EdgeTypes and what they make possible

CausedBy, Supports, Contradicts, Supersedes, RelatedTo, PartOf, and TemporalNext.

What each EdgeType lets you query

CausedBy

Trace backward from a bad decision to the facts that grounded it

Contradicts

Surface conflicting beliefs the agent is holding simultaneously

Supersedes

Follow the correction chain — what the agent believed before and when it changed

TemporalNext

Reconstruct the exact sequence of events within a session

PartOf

Link summary Episodes back to the events they compressed

1-byte edge type field. 255 possible values. Using 7. Room to grow.

Without typed edges, you have a collection of nodes with unlabeled connections — traversal is meaningless. With them, you can ask structural questions about causality, contradiction, belief history, and time. These are the questions that matter when an agent is debugging its own reasoning.

The decay formula and its known failure mode

Every node has a decay lambda. Confidence at retrieval time: base × exp(-λ × days_since_access) × log2(access_count + 1) / 10.

How decay works across different EventTypes over time

Skilldecays slowly — procedural knowledge is durable

Factmoderate decay — observed facts stay relevant longer

Decisionmoderate decay — context-dependent, may become stale

Inferencedecays fast — derived conclusions go stale quickly

Known failure mode: a genuinely important node that happens not to be retrieved for an extended period will decay to near-zero and may not surface when the relevant situation reappears. Pinning mechanism not yet built.

The access multiplier models rehearsal and retention — things accessed frequently decay more slowly. This is a heuristic borrowed loosely from spaced repetition research. The failure mode I have not solved: the things you forget first are the things that have not come up recently, which sometimes are unimportant and sometimes are exactly what you needed.

The Python layer and three days of memory boundary errors

The core library is written in Rust. The obvious path to a Python interface is a C FFI layer — compile with cdylib and staticlib targets, write Python bindings using ctypes.

Three days of this

Segmentation fault (core dumped)
Process finished with exit code 139
occurred ~4 minutes into the session. nondeterministic. could not reproduce reliably.

Reference counting — reduced frequency, did not eliminate

Explicit free protocol — Python side required to call free before drop. Still leaked.

Custom del methods — better. Not fixed.

Subprocess delegation — Python launches Rust binary, communicates via stdin/stdout JSON. Slower. Fully reliable.

The FFI compile targets are still in Cargo.toml. The Python layer does not use them.

The root cause was ownership. Rust’s borrow checker enforces memory ownership at compile time. When you expose a Rust type across an FFI boundary, you hand a raw pointer to a runtime that knows nothing about Rust’s ownership rules. Python’s garbage collector freed Python objects that held references to Rust allocations that Rust still owned. The result was segfaults that occurred minutes into sessions, nondeterministic in timing, essentially impossible to reproduce.

Subprocess delegation solved it completely. Each memory operation is a message exchange: the Python side sends a JSON command, the Rust process reads it, executes, writes a JSON response. Slower than direct FFI. For cognitive event workloads, the overhead is invisible.

What is not built yet

Current build status

Binary .amem format, header, node/edge records, LZ4 content block

DONE

6 EventTypes, 7 EdgeTypes, Supersedes correction chain

DONE

WriteEngine, QueryEngine, decay formula, Python subprocess layer

DONE

128-dim feature vector block — format slot exists, embedding model not wired

IN PROGRESS

/forget command — design clear, implementation pending

PENDING

compress_session PartOf edges dropped for sessions over 200 events — known bug, dynamic buffer allocation needed

BUG

What this actually means for how long an agent can remember you

I want to be direct about something that does not get said clearly enough in conversations about AI memory.

Right now, when you open a new chat with Claude or GPT, it does not know who you are. It does not know what you talked about yesterday, what you have been building for the last six months, what you corrected it on last week, or how your thinking has evolved. Every session starts from zero. That is not a limitation of the model. It is a limitation of not having a proper place to store memory.

AgenticMemory is that place.

How much space does a year of agent memory actually take?

1 session

~16 KB

smaller than a short email

1 year daily

~11 MB

smaller than 2 Spotify songs

5 years daily

~60 MB

smaller than the apps on your phone

age 5 → age 20

~165 MB

fits on a USB stick

The bottleneck has never been storage. It has been that no one built the right structure for that storage to be useful.

Not just what you said. What the agent decided because of what you said. What it learned from your corrections and never got wrong again. The causal chain from a conversation you had in January to a decision the agent made in September because of what you established in January.

People hear “remember everything forever” and think it requires a data center. It requires a USB stick.

The decay model means the agent is not burdened by noise. Old facts that stopped being relevant fade naturally. Decisions and corrections that you keep referencing stay strong. The memory behaves more like how you actually work — the things that matter surface, the things that do not matter quiet down.

And here is what makes this portable in a way that nothing current is: the .amem file does not belong to Claude, or GPT, or any specific provider. It belongs to you. Start with one model, switch to another — every agent picks up the same brain file and knows everything the previous ones learned about you. Your history with an agent should not be trapped inside the provider’s servers. It should live with you, travel with you, and survive every model upgrade and platform switch that will happen over the next decade.

The memory problem in AI agents is not a storage problem. It is a structure problem. Everyone who tried to solve it reached for a bigger whiteboard. What it actually needed was a proper filing system, with relationships between the files, a record of how each file was used, and a way to trace back through every decision to the facts that grounded it.

That is what this is. And it fits on a USB stick.

The memory format and query engine are in active development. More on the embedding integration when there is something worth writing about.