Why Memory Is the Hard Problem
An AI agent without memory is just a chatbot that calls tools. The ability to accumulate, organize, and retrieve information across time is what separates useful agents from impressive demos. But memory isn't one thing. Cognitive science has long distinguished between types of memory that serve different purposes β and those distinctions map onto practical engineering decisions in agent systems. Getting this wrong is one of the most common reasons we see agents fail in production. The four patterns we use are drawn from cognitive architecture research but grounded in what actually ships: working memory, episodic memory, semantic memory, and procedural memory. Each has a distinct role, distinct implementation, and distinct failure mode.
Working Memory: The Active Context
Working memory is what the agent holds in mind right now β the current task state, the immediate conversation history, the results of recent tool calls. In LLM-based agents, this is literally the context window. The engineering challenge with working memory is overflow. Context windows have grown enormously, but agent tasks can generate more state than fits. We've seen production agents that fill their context with tool outputs and can no longer fit the system prompt that tells them what to do. The solution is aggressive working memory management: summarization of older turns, structured state objects that compress information, and explicit "working memory limits" that trigger summarization before context overflow. We treat the context window as a finite resource and manage it like memory in a constrained system.
Episodic Memory: What Happened Before
Episodic memory stores records of past experiences β past conversations, past task executions, past errors. It's the difference between an agent that treats every session as fresh and one that can say "last time you asked me to do this, I found that approach X failed." Implementation is usually a vector database with timestamped entries and retrieval by semantic similarity. The agent writes summaries of completed tasks, key decisions, and errors. On new tasks, it retrieves relevant episodes and includes them in context. The critical design decision is what to store and when. Storing everything creates retrieval noise. Storing too little loses signal. We use a salience filter: only store episodes where something unexpected happened, a constraint was violated, or the user provided explicit feedback. This keeps the episodic store dense with signal rather than diluted with routine completions.
"The four memory types aren't abstractions β they map directly onto engineering decisions about storage, retrieval, and context management in production agent systems."
Semantic Memory: What the Agent Knows
Semantic memory is the agent's general knowledge base β facts, relationships, domain knowledge that aren't tied to specific episodes. This is where RAG fits as a memory system. For most production agents, semantic memory is a vector index of documents, policies, product information, or domain knowledge. The agent retrieves relevant chunks as needed rather than pre-loading everything. The design challenge is keeping semantic memory fresh. Documents update. Policies change. A semantic memory that's stale is worse than no semantic memory, because it confidently provides outdated information. We build write paths into every semantic memory system β scheduled re-indexing, event-driven updates on source changes, and staleness metadata that the agent can reason about.
Procedural Memory: How to Do Things
Procedural memory encodes skills and procedures β not facts, but methods. How to handle a refund request. How to escalate a support ticket. How to structure a legal brief. In agent systems, procedural memory usually takes the form of few-shot examples, chain-of-thought templates, or fine-tuned model weights. The agent doesn't retrieve facts; it retrieves or embodies a way of approaching a class of problem. The most powerful form is fine-tuned procedural memory: a model that has internalized a procedure through training on examples. This is faster and more reliable than prompting at inference time, but requires labeled data and retraining when procedures change. For stable, high-volume procedures, the trade-off is almost always worth it.