What Episodic Memory Means for AI Agents
Humans remember experiences, not just facts. Your agent should too.
When you remember a meeting, you don’t recall a JSON object. You remember the moment — the room, the voice, the pause before someone made a key point. Humans have episodic memory. AI agents don’t. That’s about to change.
Two Kinds of Memory
Cognitive science distinguishes between:
Semantic Memory — Facts and concepts
- “The capital of France is Paris”
- “Water boils at 100°C”
- Timeless, context-free, declarative
Episodic Memory — Experienced events
- “I remember the meeting where we discussed the budget”
- “That call where the client mentioned timeline concerns”
- Time-stamped, contextual, experiential
Most AI memory systems are semantic. Vector databases store embeddings of facts. RAG retrieves documents. But agents that perceive need episodic memory. They need to remember what they saw and heard, when it happened, and what the context was.
Why Episodic Matters
Consider these queries:
| Query | Memory Type | What’s Needed |
|---|---|---|
| “What is our pricing model?” | Semantic | Retrieved from docs |
| “What did the client say about pricing last Tuesday?” | Episodic | Retrieved from recordings |
| “How many people attended the meeting?” | Episodic | Visual memory of the event |
| “What was on screen when they mentioned the deadline?” | Episodic | Multimodal temporal context |
Semantic memory can’t answer episodic questions. You need memory of experiences — not just facts.
Video as Natural Episodic Memory
Video is inherently episodic.
Video captures several properties that make it ideal for episodic memory:
- Time-indexed — Every frame has a timestamp
- Multi-sensory — Visual + audio together
- Contextual — Shows the environment, not just content
- Continuous — Captures the flow of events
When you record a meeting, you’re creating episodic memory. The challenge is making it retrievable.
The Memory Problem
Raw recordings aren’t queryable. You can’t ask an MP4 file “what happened?”
Traditional approaches:
- Full transcription — Converts audio to text, loses visual context
- Frame extraction — Expensive, loses temporal flow
- Manual notes — Doesn’t scale, subjective
- Just store it — Recording exists but no one can find anything
None of these create true episodic memory. They create archives.
Indexed Episodic Memory
The solution: indexes that understand what happened and when.
# Create episodic memory from a video
video.index_spoken_words() # What was said
video.index_scenes(prompt="Describe activities and events") # What happened
# Query episodic memory
results = video.search("budget discussion")
for shot in results.shots:
print(f"At {shot.start}s: {shot.text}")
shot.play() # Relive the moment
The index is the memory. It captures:
- What happened (semantic content)
- When it happened (timestamps)
- Evidence (playable links)
Ephemeral vs Persistent
Not all perception needs permanent memory.
Ephemeral — Process but don’t store
- Real-time event detection
- Privacy-sensitive contexts
- Temporary sessions
Ephemeral indexing processes without persisting:
rtstream.index_visuals(
prompt="Detect safety issues",
ephemeral=True # Don't persist
)
Persistent — Store for later recall
- Meeting recordings
- Training content
- Compliance archives
Persistent memory uses standard indexing:
video.index_spoken_words() # Stored by default
You control what your agent remembers.
Desktop as Continuous Input
Desktop capture creates continuous episodic input:
cap = conn.create_capture_session(end_user_id="user_123")
# What the agent "experiences":
# - Screen content (visual)
# - Microphone (spoken)
# - System audio (ambient)
The agent perceives the user’s experience in real-time. With indexing, it builds memory. Later:
# Agent recall
"Remember when I was debugging that error? What file was I looking at?"
results = cap.search("debugging error")
shot.play() # Show the moment
Multi-Session Memory
Episodic memory spans sessions:
# Search across all recordings
results = coll.search("product roadmap discussions")
# Results from any video in the collection
for shot in results.shots:
print(f"Video: {shot.video_id}, Time: {shot.start}s")
print(f"Content: {shot.text}")
The agent doesn’t just remember one meeting. It remembers all meetings.
Grounded Answers
Episodic memory enables grounded responses:
“At 14:32 in yesterday’s meeting, Sarah said ‘We need to revisit the enterprise tier pricing.’ Here’s the clip: [play]”
The difference is trust. Episodic memory provides verifiable evidence.
The Future
The agents we’re building will:
- Perceive continuously (screens, mics, cameras)
- Index what they perceive (spoken, visual, events)
- Remember across sessions (episodic recall)
- Answer with evidence (playable proof)
This isn’t science fiction. The architecture exists today.