AISeries

Agent Dreams

Offline reflection lets agents find patterns they miss in real time -- but building a reliable dreams pass is harder than it sounds.

H
Harnyss Team
May 10, 2026 · 11 min read

In Part 2 we walked through memory architecture — how to store experience, how to retrieve it well, how to weight it by importance and outcome. With those pieces in place, an agent that completes a task today is meaningfully better-equipped than one that completed the same task in isolation a month ago.

But there's a ceiling to what memory alone can do.

A memory pool, even a well-curated one, is still a pile of discrete observations. The patterns that connect them, the contradictions hidden between them, the mistake-lessons phrased as hindsight that should be rewritten as forward-looking rules — none of that surfaces from retrieval. Cosine search returns the most similar entries to the current query. It doesn't synthesize. It doesn't notice that three different memories all point at the same underlying truth. It doesn't reread an old mistake-lesson and realize it should be a forward-looking rule now.

For that, you need an explicit reflection step. We call it dreams.

This post is about what reflection looks like, why most production systems skip it, what we found when we built it, and what we're still figuring out.


Where the idea comes from

The clearest predecessor in the literature is the 2023 Stanford paper "Generative Agents: Interactive Simulacra of Human Behavior" (Park et al). The team built a small simulated town populated by 25 LLM-driven characters and gave each one a memory stream — every observation, every conversation, every event got written down. To stay under the context window, they retrieved by relevance plus recency plus importance (the same three axes we covered in Part 2).

The novel piece was reflection. When an agent's accumulated importance crossed a threshold, the system would pause it and ask: given recent memories, what high-level questions should I ask myself? And given those questions, what insights can I draw? The model would answer, and the answers became new memories — abstract, higher-importance, citing the originals as sources.

The behavioral effects were striking. Agents who'd had time to reflect started exhibiting traits that hadn't been programmed in: preferences for certain people, plans that spanned multiple days, plausible reactions to surprise events. Reflection turned out to be load-bearing for emergent intelligence in a way the authors hadn't fully anticipated.

The same pattern shows up in cognitive science around sleep-dependent memory consolidation. The brain doesn't just store memories during sleep — it actively replays them, strengthens connections between related events, and prunes weak associations. Sleep is when raw experience becomes generalized knowledge. The biological analogy is imperfect, but the functional argument is the same: there's a job that has to happen outside the moment of experience, and skipping it leaves you with worse-organized memories.

Most production agent systems skip it. They store memories. They retrieve memories. They never re-read the corpus and ask what it adds up to.


Why mechanical clustering isn't enough

The first instinct most teams have, when faced with the "memory pool gets too noisy" problem, is to add a clustering job. Group same-category short-term memories. Use the model to summarize each cluster of 3+ into a single consolidated entry. Replace the originals with the summary.

This works. It keeps the pool bounded. It produces summaries that are more useful than the raw entries they replaced.

But mechanical clustering has hard limits:

  • It only operates on similar entries. Patterns that span across categories — say, three observations about brand voice plus two about client preferences plus one about tool reliability that together suggest a deeper truth about the client — never get combined. The clustering job groups by category. The cross-category pattern is invisible to it.
  • It produces summaries, not insights. The output of clustering similar memories is a consolidated description of what they had in common. That's information compression. It's not synthesis. A summary of "three times the agent forgot to save the document at the end of a task" is just "the agent has forgotten to save the document three times" — true, but no more actionable than the originals.
  • It can't resolve contradictions. Two memories that say opposite things both get kept. Clustering doesn't have a notion of which is correct or current. Eventually one wins by recency, but the explicit contradiction never gets called out.
  • It can't upgrade weak lessons. A mistake-lesson saying "I should have included a chart" sits there forever. The forward-looking version — "When summarizing quarterly data for this client, include at least one chart" — never gets written, because clustering doesn't know what a forward-looking rule looks like.
What's missing is a step that reads the entire long-term corpus and reasons about it as a whole. Not similar groups. The whole pool. With enough context to notice patterns that span categories, surface contradictions explicitly, and rewrite weak entries into actionable ones.

That's reflection. That's dreams.


What a dreams pass looks like

The basic shape, agnostic of any particular implementation:

  1. Schedule — reflection isn't free, so it runs on a schedule, not synchronously
  2. Eligibility — only run for agents with enough memories to be worth reflecting on (10+ long-term entries is a reasonable floor)
  3. Corpus load — pull the agent's full long-term memory pool, capped at some reasonable number, sorted by importance
  4. LLM call — send the corpus to the model with a prompt that asks for three structured outputs:
    • Pattern insights — actionable forward-looking observations grounded in 2+ source memories
    • Contradictions — pairs of memories that cannot both be true, with a recommendation on which to keep
    • Lesson upgrades — mistake-lessons rewritten as learned patterns with concrete forward-guidance
  5. Validation — every memory ID the model references must come from the corpus you sent (otherwise it's hallucinated; drop it)
  6. Apply — insert pattern insights, delete the deprecated side of resolved contradictions, rewrite upgraded lessons in place
  7. Mark dreamed — record when the reflection ran so the next pass skips this agent until enough time has passed

The cost shape: one model call per agent per reflection cycle. Sonnet-tier intelligence on a large memory corpus runs in the low tens of cents per agent. With a weekly cadence and a typical agent roster, the budget impact is negligible compared to the value generated.


The hard parts

Picking the right cadence

Too often and you're paying for synthesis with no new material to synthesize. Too rarely and patterns rot before they get noticed. Weekly works well in practice, with an eligibility check that ensures the agent has had enough recent activity to reflect on. Heavily-used agents hit eligibility every cycle; dormant ones skip until they've accumulated enough to make reflection worthwhile.

Picking the right model

Reflection is expensive enough that you don't want it on the most expensive tier by default. It's also too important to do on the cheapest tier — you want a model that can actually notice cross-category patterns and write coherent forward-guidance. Sonnet-tier is the sweet spot for the reflection task: smart enough to reason across the full corpus, cheap enough to run weekly without the budget showing up in dashboards.

Avoiding hallucinated patterns

The single biggest risk of reflection is the model inventing a pattern that doesn't exist in the corpus. An LLM asked "what patterns do you see across these memories?" will produce something — even if there's no real pattern. The output looks plausible, gets written as a high-importance long-term memory, and now the agent's pool is contaminated with a fabricated "lesson" that future reflection passes will themselves cite.

Mitigations that matter:

  • Source validation. Every pattern insight must reference 2+ source memories from the corpus. Any reference that can't be matched to a real entry in the corpus gets dropped entirely.
  • Strict output structure. The model operates within a defined schema for what it can produce. Uncategorizable or out-of-scope outputs don't enter the pool.
  • Bounded count. Capping the number of insights per pass forces the model to pick the strongest signals rather than producing a long list of weak ones.
  • Importance floor. Pattern insights are durable entries — they should only enter the pool if the model rates its own confidence above a meaningful threshold. This correlates with quality in practice.

These don't eliminate the risk. They reduce its rate to where the value generated outweighs the noise introduced.

What to do with sources

A subtle design choice: when a pattern insight is created from several source memories, do you keep the originals or replace them?

Mechanical clustering replaces them — once you've summarized, the source data is redundant. Dreams keeps them. The reasoning: a pattern insight is a higher-level statement about the sources, not a replacement for them. The sources may still be relevant on their own (a specific past task, a specific observation). And the pattern insight references them, so an operator reviewing the pool can trace the chain back to the original evidence.

The operational benefit is provenance — we can always answer "where did this insight come from?" and audit whether it actually matches the cited evidence.


What changes after dreams ship

The first thing you notice is that the memory section in dispatched agents starts including pattern-insight entries that are noticeably more polished than the raw task summaries around them. Where a task summary reads "Task: blog_draft. Quality: 85/100. Observation: lead paragraph used a stat-driven hook", a pattern insight from a dreams pass reads "Blog drafts that open with a concrete data point in the first 50 words consistently score 10-15 points higher than those that open with brand framing. Apply when drafting client-facing thought leadership."

The first is information. The second is guidance.

The second thing you notice is that the agent stops repeating mistakes that get reframed as forward-guidance. A mistake-lesson sits in the corpus until something rereads it; once it's rewritten as a learned pattern with a clear "when X, do Y" structure, the next recall surfaces it as actionable advice rather than as hindsight. The behavior change is real and measurable.

The third thing — and this one took us by surprise — is that the shared memory pool starts populating itself. We auto-share the strongest pattern insights from each dreams pass to the workspace pool. With weekly reflection running across an active workspace, the shared pool accumulates the strongest signals from across the agent roster without anyone explicitly choosing to broadcast them. New agents joining the workspace inherit the team's accumulated insights on day one.


How Harnyss does it

The dreams pass runs on a weekly schedule. For each active agent that hasn't reflected recently and has enough long-term memories to make reflection worthwhile, we load the top memories by importance and send the corpus to the model with a structured prompt.

The prompt asks for three things: pattern insights grounded in multiple source memories, contradictions between existing entries with a recommendation on which to keep, and rewrites of weak hindsight-lessons into actionable forward-guidance. The structure is tight — the model operates within a defined schema and a bounded output count that forces it to surface only the strongest signals.

Before anything gets applied, every output is validated. Pattern insights that reference memories not in the corpus are dropped. Rewrites targeting the wrong memory type are dropped. Contradictions with malformed references are dropped. We track how much gets filtered per pass so we can spot calibration drift over time.

The strongest pattern insights auto-propagate to the workspace shared pool, with a deduplication check to prevent near-identical broadcasts from accumulating. The result: the workspace pool fills itself with the most durable, cross-validated insights the agent roster has collectively developed — without anyone manually curating it.

After each pass, the agent is marked as having dreamed and excluded from the next cycle until enough time has elapsed.

The full run for a typical workspace takes a few minutes and costs a couple of dollars. It goes through the same model routing and budget tracking as every other LLM call in the platform — no special accounting, no hidden costs.


The bottom line

Memory plus reflection is meaningfully different from memory alone. An agent that accumulates experience but never synthesizes it is a better-equipped version of a blank slate — useful, but not compounding. An agent that reflects turns that experience into something more durable: forward-looking rules, resolved contradictions, cross-validated patterns that hold across tasks and time.

Most agent platforms don't do this. They have memory. They don't have reflection. The gap shows — in agents that repeat the same mistakes, miss the patterns in their own work, and never get meaningfully smarter the longer they run.

The frozen model is the same. The harness gets sharper, week by week.


Part 1: How Agents Learn — the overview that frames memory and reflection in the broader picture.

Part 2: How Agents Have Memory — the deep dive on memory architecture, vector search, and recall ranking.

More in AI

AI

Workflow Automation Follows a Script. A System of Operation Runs the Function.

Workflow automation runs the steps you set up in advance. A system of operation is handed a goal and works out the steps itself — including when something happens that no one planned for.

Harnyss Team
Jun 1, 2026 · 5 min read
Read →
AI

Systems of Operation: The Next Category of Business Software

Business software has had two great eras — systems of record that remember, and systems of engagement that help. A third is starting: systems of operation that do the work. Here's the case that it's a genuine new category, not a feature.

Harnyss Team
May 28, 2026 · 6 min read
Read →
AI

The Self-Operating Company

Every business tool ever built made the human operating it faster. A self-operating company is different — the function runs, and the human sets direction instead of driving the software. Here's what that actually changes.

Harnyss Team
May 28, 2026 · 5 min read
Read →