Skip to main content
Back
Desktop agents

Agents that watch the screen with you.

The desktop is the highest-leverage surface AI has ever had access to. We're going to build the next decade of productivity software on top of it — and it starts with an agent that can see what's on your screen.

A desktop agent watching the screen alongside the user.

For the last forty years, every productivity tool you've ever used has been blind. Word processors don't know what you're writing about. IDEs don't watch you debug. Slack has no idea what you just said on the call. The most expensive surface on your computer — the screen itself — has been a complete blind spot.

That's about to change. The first generation of desktop agents is here, and they are nothing like the chatbots that came before them. They don't sit in a sidebar waiting to be summoned. They run in the background, continuously, watching pixels and listening to mic input, building a memory of what you've done and a model of what you need next.

If LLMs were the moment text got cheap, desktop agents are the moment context got cheap. And context, it turns out, is what work actually runs on.

Why now

Three things had to converge at once — and now they have.

Three things had to happen in parallel for desktop agents to be possible. Models had to get fast and multimodal enough that watching a screen at 24fps wasn't a fantasy. OS-level capture APIs had to mature on every major platform. And someone had to build the runtime: the part that handles streams, indexes frames, manages memory, and exposes the whole stack as a single tool any agent loop can call.

VideoDB is that runtime. One install gives your agent a native bridge into the operating system, an ingest pipeline that runs at the speed of the OS, and a memory layer that lets it recall anything it watched as a searchable, replayable clip.

"The desktop agent doesn't ask what you want. It already saw."

Three things builders are shipping today

Not demos. Products people are paying for right now.

1. The pair programmer that actually pairs

Copilot in your editor is a good autocomplete. A pair programmer is something else: it watches the whole environment. It sees the architecture diagram you opened in another window. It sees the YouTube deep-dive you queued up about the auth flow you're rewriting. It hears the question your colleague asked in the last call. When you ask "how should I rebuild this?" it answers from the same shared context a human would.

VideoDB ships a reference implementation at video-db/pair-programmer. It uses the screen capture stream as the primary input and threads a recall API into the assistant so any moment from the last hour, day, or week is one query away.

2. The meeting copilot without the bot

The standard meeting assistant joins your call as a third participant. Everyone in the room watches a robot blink in the corner. That model is dying. The new one is local: capture the audio and screen-share on the device the meeting is already happening on, run perception locally, never invite a stranger to your call.

call.md is the open-source reference. Every meeting becomes a markdown document. Every decision in it has a playable clip attached. When someone says "what did we agree about pricing?" the agent answers with the moment, not a paraphrase.

3. The second brain that finally works

People have been trying to build "second brains" for two decades. Every attempt has failed for the same reason: capture is too hard. Nobody wants to take notes, tag emails, transcribe meetings, save the right Slack threads. Desktop agents solve this without asking anyone to change a behavior. The capture happens whether you notice it or not. The cognitive load drops to zero.

Memory becomes a recall API. The agent watching your screen turns into the most reliable note-taker you've ever had — one that remembers everything and forgets only what you tell it to.

What you actually get

One SDK, a typed event stream, and frames that never touch disk.

Install the native SDK on Mac, Windows, or Linux. One package, three lines of code:

# Stream screen + mic continuously
vdb = VideoDB()
async with vdb.desktop(screen=True, mic=True):
    async for event in vdb.stream():
        # transcripts, screen events, intents - typed
        agent.handle(event)

What you get back is not raw video. It's a stream of typed events: transcripts, screen changes, application focus, recognized intents. Your agent can subscribe to exactly the signals it cares about and ignore the rest. Frames flow through the process without ever touching disk — unless you decide a moment is worth remembering.

Privacy, on by default

Every desktop deployment is ephemeral by default. Frames are processed in-memory and discarded. Persistence is opt-in, per-stream, and can be locked to your own cloud. SOC 2 and HIPAA-ready out of the box.

The category that didn't exist three years ago

The most consequential software category of the next five years.

Desktop agents will be the most consequential product category of the next five years. Not because they're a flashier chatbot, but because they finally close the loop between what software sees a user doing and what it can help them do. That loop has been open since the GUI was invented. VideoDB is the cheapest, fastest way to close it.

The builders shipping on this stack today are building the next Slack, the next Notion, the next Figma. Not because they have better models — everyone has the same models. They have something nobody else has: a backend that actually understands what's happening on the screen.

Machine