Philosophy

What Episodic Memory Means for AI Agents

Humans remember experiences, not just facts. Your agent should too.

When you remember a meeting, you don’t recall a JSON object. You remember the moment — the room, the voice, the pause before someone made a key point. Humans have episodic memory. AI agents don’t. That’s about to change.

Two Kinds of Memory

Cognitive science distinguishes between:

Semantic Memory — Facts and concepts

“The capital of France is Paris”
“Water boils at 100°C”
Timeless, context-free, declarative

Episodic Memory — Experienced events

“I remember the meeting where we discussed the budget”
“That call where the client mentioned timeline concerns”
Time-stamped, contextual, experiential

Most AI memory systems are semantic. Vector databases store embeddings of facts. RAG retrieves documents. But agents that perceive need episodic memory. They need to remember what they saw and heard, when it happened, and what the context was.

Why Episodic Matters

Consider these queries:

Query	Memory Type	What’s Needed
“What is our pricing model?”	Semantic	Retrieved from docs
“What did the client say about pricing last Tuesday?”	Episodic	Retrieved from recordings
“How many people attended the meeting?”	Episodic	Visual memory of the event
“What was on screen when they mentioned the deadline?”	Episodic	Multimodal temporal context

Semantic memory can’t answer episodic questions. You need memory of experiences — not just facts.

Video as Natural Episodic Memory

Video is inherently episodic.

Video captures several properties that make it ideal for episodic memory:

Time-indexed — Every frame has a timestamp
Multi-sensory — Visual + audio together
Contextual — Shows the environment, not just content
Continuous — Captures the flow of events

When you record a meeting, you’re creating episodic memory. The challenge is making it retrievable.

The Memory Problem

Raw recordings aren’t queryable. You can’t ask an MP4 file “what happened?”

Traditional approaches:

Full transcription — Converts audio to text, loses visual context
Frame extraction — Expensive, loses temporal flow
Manual notes — Doesn’t scale, subjective
Just store it — Recording exists but no one can find anything

None of these create true episodic memory. They create archives.

Indexed Episodic Memory

The solution: indexes that understand what happened and when.

# Create episodic memory from a video
video.index_spoken_words()  # What was said
video.index_scenes(prompt="Describe activities and events")  # What happened

# Query episodic memory
results = video.search("budget discussion")

for shot in results.shots:
    print(f"At {shot.start}s: {shot.text}")
    shot.play()  # Relive the moment

The index is the memory. It captures:

What happened (semantic content)
When it happened (timestamps)
Evidence (playable links)

Ephemeral vs Persistent

Not all perception needs permanent memory.

Ephemeral — Process but don’t store

Real-time event detection
Privacy-sensitive contexts
Temporary sessions

Ephemeral indexing processes without persisting:

rtstream.index_visuals(
    prompt="Detect safety issues",
    ephemeral=True  # Don't persist
)

Persistent — Store for later recall

Meeting recordings
Training content
Compliance archives

Persistent memory uses standard indexing:

video.index_spoken_words()  # Stored by default

You control what your agent remembers.

Desktop as Continuous Input

Desktop capture creates continuous episodic input:

cap = conn.create_capture_session(end_user_id="user_123")

# What the agent "experiences":
# - Screen content (visual)
# - Microphone (spoken)
# - System audio (ambient)

The agent perceives the user’s experience in real-time. With indexing, it builds memory. Later:

# Agent recall
"Remember when I was debugging that error? What file was I looking at?"

results = cap.search("debugging error")
shot.play()  # Show the moment

Multi-Session Memory

Episodic memory spans sessions:

# Search across all recordings
results = coll.search("product roadmap discussions")

# Results from any video in the collection
for shot in results.shots:
    print(f"Video: {shot.video_id}, Time: {shot.start}s")
    print(f"Content: {shot.text}")

The agent doesn’t just remember one meeting. It remembers all meetings.

Grounded Answers

Episodic memory enables grounded responses:

“At 14:32 in yesterday’s meeting, Sarah said ‘We need to revisit the enterprise tier pricing.’ Here’s the clip: [play]”

The difference is trust. Episodic memory provides verifiable evidence.

The Future

The agents we’re building will:

Perceive continuously (screens, mics, cameras)
Index what they perceive (spoken, visual, events)
Remember across sessions (episodic recall)
Answer with evidence (playable proof)

This isn’t science fiction. The architecture exists today.