AISeries

How Agents Have Memory

A deep dive into the memory architecture that lets agents retain, retrieve, rank, and forget -- and why forgetting is the hardest part.

H
Harnyss Team
May 10, 2026 · 12 min read

In Part 1 of this series we framed agent learning as something that happens at the harness layer, not the model layer. The model's weights are frozen. The intelligence that compounds over time lives in the surrounding system — and at the center of that system is memory.

Memory is the single biggest lever for whether an agent feels useful or feels brand new every time you talk to it. This post walks through the design space: where memory lives, how retrieval works, how importance gets scored, and the tradeoffs that decide whether your memory layer helps or actively hurts. Then we'll show how Harnyss approaches each of those decisions.


What "memory" actually means in an agent system

In cognitive science, memory is divided into rough categories:

  • Working memory — the small, short-lived buffer holding what you're attending to right now
  • Episodic memory — specific events, time-stamped, autobiographical ("I had coffee with Sarah on Tuesday")
  • Semantic memory — generalized knowledge stripped of source ("coffee contains caffeine")
  • Procedural memory — how-to knowledge, often non-verbal ("how to ride a bike")

Agent memory systems borrow the first three concepts almost directly. Working memory is the model's context window. Episodic memory is the log of past tasks and conversations. Semantic memory is the distilled, generalized form — the "I learned that this client always wants their reports in a specific format" version of "this happened in the May 3rd report."

The fourth, procedural memory, is interesting: in agent systems, it's often encoded as skills — operator-curated or self-discovered procedures that the agent loads on-demand. Anthropic's Skills are a good example: a folder containing a SKILL.md plus supporting files, loaded when the description matches the current task. We treat skills separately from the memory pool because their lifecycle and source are different — they're written, reviewed, and versioned, not accumulated.

For the rest of this post, "memory" means episodic plus semantic — the things an agent has learned through experience.


Where memory actually lives

Three architectural choices, with very different cost and complexity profiles:

Option A: In-context only

Every relevant fact lives in the system prompt or the conversation history. Nothing persists outside the current session. This is the default. It works fine for one-shot tasks. It collapses the moment you want continuity.

The hard limit is the context window — Claude Opus 4.7 supports 1M tokens at standard pricing, which is enormous, but still finite, and growing the prompt grows the cost on every call. Anthropic's compaction beta mitigates the overflow risk by automatically summarizing earlier turns when the conversation approaches the limit, but compaction summaries lose detail by design. They're not memory; they're a survival mechanism.

Option B: External datastore

Save task summaries, observations, and lessons to durable storage. Retrieve relevant entries when a new task starts and inject them into context. The store can be a SQL table, a vector database, a flat-file system, or some combination.

This is where most production systems land. The reason is simple: storing data is cheap and well-understood; managing context windows is complicated. Pushing memory to durable storage gives you a clean separation — write once, retrieve when relevant, never have to worry about overflow.

Anthropic's memory tool sits here, in the simplest form: the agent reads and writes a directory of files. Smarter setups index the contents — vector embeddings for semantic search, full-text indexes for keyword fallback, importance scores for prioritization.

Option C: Hybrid — paged context

The most sophisticated, championed by MemGPT (Packer et al, 2023): the agent has both an in-context working set and an external memory pool, and it manages the boundary explicitly via tool calls. When the working set fills, the agent decides what to evict; when it needs older context, it queries the pool to page it back in. The OS analogy is intentional.

This is powerful but adds significant complexity. Most teams don't need the paging behavior — recall is cheap enough that retrieving fresh on each turn works fine.


The retrieval problem

Once memory lives outside the context window, retrieval becomes the hard part. You can't pull everything; the budget is finite. You can't pull randomly; relevance matters. The retrieval ranker is, in many ways, where the actual quality of the memory layer gets decided.

A few axes that matter:

Relevance — usually cosine similarity over embeddings

The default. Embed the query, embed every memory, return the top-k by cosine distance. Modern embedding models (OpenAI's text-embedding-3-*, Voyage's voyage-3, Cohere's embed-v3) all produce dense vectors in the 1024-1536 dimension range. pgvector and similar extensions make this fast even at hundreds of thousands of entries.

Cosine alone is a starting point, not a finishing point. It tells you which memories are semantically similar to the query — not which are most useful, most recent, most important, or most reliable.

Recency — fresh memories matter more

Two memories with identical cosine similarity should not rank identically if one was written yesterday and one was written 90 days ago. The fresher one is more likely to reflect current reality (brand voice, client preferences, ongoing context), more likely to have been validated, more likely to be load-bearing.

A simple time-decay factor handles this: recency = 1 / (1 + days_since_last_access / N). Pick N based on how fast your domain shifts. A 30-day half-life works well for most operational contexts.

Importance — not all memories are equal

A memory tagged "the client prefers Oxford commas" should not rank with a memory tagged "the client said hi on Tuesday." Importance is the weight that captures this. It can be:

  • Set explicitly when writing the memory (an agent saving a lesson rates its own confidence)
  • Adjusted by access (memories that get retrieved often probably matter)
  • Adjusted by outcome (memories that contributed to successful tasks should accumulate weight)

The third one — outcome-correlated reinforcement — is the most underused. We'll come back to it.

Recall budget

Even with perfect ranking, you have to decide how many memories to inject. Every memory in the prompt costs tokens. Pull too few and you miss relevant context; pull too many and you bloat the prompt, slow the model down, and dilute attention. A reasonable budget is 5-10 memories, truncated to a few hundred characters each, totaling under ~800 tokens.


Importance scoring — write-time and read-time

There's no canonical way to score importance. The simplest version is a flat scale (1-10) set when the memory is created. Better versions update over time:

  • Promotion — a memory recalled often (high access count) gets promoted from short-term to long-term
  • Outcome correlation — a memory recalled before a high-quality task completion gets its importance bumped
  • Decay — memories not accessed in a long time get demoted or evicted
The key insight: importance is a prediction about future usefulness, and predictions improve when you incorporate signal from actual outcomes. A memory that was theorized to be important when written but never gets recalled probably wasn't. A memory that gets recalled before a task that scores 95/100 almost certainly was.

This creates a compounding dynamic: good memories rank higher, get recalled more often, get recalled into more successful contexts, rank higher still. Not unlike how human expertise compounds — the things that actually prove useful get reinforced, and the noise fades.


When to forget

An unbounded memory pool is as useless as no memory. Volume without curation produces noise, slow retrieval, and diluted relevance. The goal is a pool that stays dense with high-signal entries — which means active management of what gets kept.

Forgetting strategies:

  • TTL on short-term entries — anything not promoted before its expiry gets deleted
  • Cap with eviction — when over the hard cap, evict the lowest-importance/oldest entries
  • Consolidation — merge clusters of similar entries into a single higher-level summary, drop the originals
  • Reflection-driven pruning — during a periodic reflection pass, the agent identifies contradictions and drops the deprecated side

Most of these are mechanical — clustering by category and summarizing, e.g. — and they keep the pool bounded without much intelligence. The reflection-driven version is harder and more interesting; we cover it in Part 3.


Per-agent vs. shared memory

If you're running a multi-agent system, an early architectural question is whether every memory belongs to the agent who wrote it, or whether some memories belong to the workspace.

Per-agent memory is simpler — clean ownership, easy to reason about, no cross-contamination. The downside is knowledge silos: when your CMO agent learns that a particular client always wants data-driven openings, your Content Writer doesn't benefit unless someone explicitly ports the lesson over.

Workspace-scoped shared memory addresses this. It's a real design choice with real tradeoffs:

  • Who can write? Probably not every agent. Junior agents broadcasting their discoveries to the whole org will produce a lot of noise. A reasonable rule: only senior or manager-tier agents, with meaningful quality gates on what gets shared.
  • Dedup matters more. A workspace pool with five near-identical entries about the same brand voice rule is worse than one. Similarity checks on write keep it clean.
  • Cap separately. Treating per-agent and shared pools as independent, capped pools is cleaner than mixing them.

The payoff: when agent A discovers something useful, agents B, C, and D inherit it on their next dispatch. The team gets smarter together rather than independently.


How Harnyss approaches each of these

Harnyss runs a multi-agent workspace model — each agent has its own memory, its own mandate, and its own operating history. The design choices below reflect what we found after building and running this in production.

Storage and retrieval

We use vector-based semantic search as the backbone of recall. Every memory is embedded and indexed for fast similarity search. The retrieval ranker blends relevance and recency — semantic similarity dominates, but freshness matters. A recently-validated lesson about a client's preferences outranks an equally-similar observation from six months ago. The top results from each agent's own pool, plus any relevant shared workspace memories, get formatted into the system prompt at dispatch.

Agents can also query their memory pool mid-task when something specific comes up — a name they half-recognize, a pattern they think they've seen before. That lookup happens as a live tool call rather than through the static system prompt, which keeps the prompt stable for caching purposes.

Two-tier memory: short-term and long-term

Not everything that happens is worth keeping indefinitely. We maintain a two-tier structure: short-term memories are captured broadly and expire if they don't prove useful; long-term memories never expire. Promotion between tiers is automatic — a memory that keeps getting recalled, or that was marked high-importance at write time, earns its place in long-term storage. Everything else clears on a rolling basis.

This keeps the pool bounded and dense. A well-managed memory pool isn't a log — it's a curated set of the things that actually matter.

Outcome-correlated reinforcement

This is the piece that makes the memory layer genuinely self-improving. When a task completes with a strong quality score, the memories that were recalled into context before it started get their importance bumped. They contributed to a good outcome. They should rank higher next time.

The cumulative effect: memories that consistently appear in successful tasks accumulate weight over time. Memories that don't correlate with good outcomes stay put and gradually get displaced by ones that do. The memory pool shapes itself toward what actually works, not just what seemed important when it was written.

Cross-agent sharing

The shared workspace pool is the mechanism by which individual agent learning becomes organizational learning. When an agent in the workspace develops a strong, validated insight — something that held across multiple tasks and has clear general value — it can broadcast that to the workspace pool, where every other agent can recall it.

Write access to the shared pool is gated: only senior-tier agents can write to it, only high-importance entries qualify, and a deduplication check runs on every write to prevent the pool from filling with near-identical variations of the same lesson.

The result is an agent roster where good lessons propagate without operator intervention. One agent's discovery becomes the whole team's context.

What we don't do

We don't expose the raw memory pool in the operator UI. Operators see performance signals and behavioral trends derived from memory, but the pool itself is internal. We may surface it for transparency and editing in the future — for now it operates cleanly under the hood.

We don't sync memory across workspaces. Every tenant's memory is fully isolated. That's non-negotiable.

We don't try to encode procedural memory in the memory pool. Skills handle that. Memory is for episodic and semantic content; procedures are versioned artifacts that need explicit authorship and review.


What's next

Memory gets you most of the way. But a memory pool — even one with vector recall, recency weighting, importance scoring, outcome-correlated reinforcement, and cross-agent sharing — is still a pile of discrete observations. The patterns that connect them, the contradictions that need resolving, the lessons that should be more actionable than they are: those don't surface from retrieval alone. They require an explicit synthesis step.

In Part 3 we cover what happens when an agent gets time to dream — and what we found when we gave each of ours a weekly reflection pass.


Part 1: How Agents Learn — the overview that frames where memory fits in the broader picture.

Part 3: Agent Dreams — what we found when we gave each agent a weekly reflection pass.

More in AI

AI

Workflow Automation Follows a Script. A System of Operation Runs the Function.

Workflow automation runs the steps you set up in advance. A system of operation is handed a goal and works out the steps itself — including when something happens that no one planned for.

Harnyss Team
Jun 1, 2026 · 5 min read
Read →
AI

Systems of Operation: The Next Category of Business Software

Business software has had two great eras — systems of record that remember, and systems of engagement that help. A third is starting: systems of operation that do the work. Here's the case that it's a genuine new category, not a feature.

Harnyss Team
May 28, 2026 · 6 min read
Read →
AI

The Self-Operating Company

Every business tool ever built made the human operating it faster. A self-operating company is different — the function runs, and the human sets direction instead of driving the software. Here's what that actually changes.

Harnyss Team
May 28, 2026 · 5 min read
Read →