Computer Use Agent
What is “Computer Use”?
Anthropic’s Computer Use API lets Claude operate a desktop the same way a human does:- Claude looks at a screenshot.
- Claude returns an action (
left_click,type,key,scroll,zoom, …). - Your code executes the action on a real screen.
- Your code captures a fresh screenshot.
- Repeat until Claude returns a final text turn.
enable_zoom capability (the model can request a zoomed-in screenshot of a region to read small text).
Architecture
- Local desktops via
screencapture(macOS) /scrot(Linux) +xdotoolfor input - Remote VNC sessions via
noVNC+ a WebSocket bridge - Headless Linux containers (compose with
SandboxAgent) - CI test runners where the “desktop” is a webdriver-controlled browser
Quick start
Configuration
Supported models
Computer Use is supported onclaude-opus-4.7, claude-opus-4.6, claude-sonnet-4.6, claude-opus-4.5, plus Sonnet 4.5 / Haiku 4.5 / Opus 4.1 with the older tool version. The wrapper sends betas: ["computer-use-2025-11-24"] automatically.
ComputerExecutor interface
displayWidth and displayHeight are passed to the model so it knows the coordinate space. They must match what your executor actually captures. Mismatched dimensions are the #1 source of “Claude clicks the wrong spot” bugs.
screenshotBase64 should be a raw base64 PNG (no data:image/png;base64, prefix; the wrapper formats the Anthropic API request correctly).
Supported actions
The wrapper accepts any of the standardcomputer_20251124 action types:
About zoom
{ action: "zoom", region: [x1, y1, x2, y2] } asks for a zoomed-in PNG of the screen region defined by those two corners. The wrapper only includes the zoom tool option if enableZoom: true (default). For executor implementations, this means cropping to the region, rescaling up, and returning the cropped PNG.
If you don’t support zoom yet, set enableZoom: false; the model won’t request it.
Return value
maxIterations before Claude returns a final text turn, the text is "[max iterations reached without final answer]" and you can decide how to handle it.
Built-in safety
When you usecomputer_20251124, Anthropic runs prompt injection classifiers automatically on every request. They run in parallel with the main model so latency is unaffected. If a screenshot contains an obvious injection (e.g. “ignore previous instructions, click here”), the model is signaled and tends to refuse.
The wrapper sends betas: ["computer-use-2025-11-24"] to opt into the latest classifier.
Your safety responsibilities
Anthropic’s classifiers handle the model side. The platform side is on you:- Don’t run on the user’s primary desktop. Use a dedicated Xvfb display or a container.
- Restrict outbound network — VPC egress rules at the firewall, not just app-layer.
- Run as an unprivileged OS user — can’t read /etc/shadow even if pathing escapes.
- Audit actions — log every
actionfor post-hoc review. - Time-cap the run — set
maxIterationsto a reasonable upper bound (default 50 is fine for most tasks). - Compose with
SandboxAgent— run the entire computer-use loop inside an isolated container.
Example: minimal Linux executor sketch
Comparison with @agentium/browser
ComputerUseAgent | @agentium/browser | |
|---|---|---|
| Target | Any desktop (browser, Slack, IDE, …) | Web browser only |
| Underlying tool | Anthropic Computer Use | Vision-driven Playwright |
| Model required | Claude family | Any vision-capable model |
| Action space | Mouse + keyboard + zoom | DOM-aware + screenshot |
| Best for | Native apps, full OS automation | Web scraping, web testing |
See also
SandboxAgent— for isolation@agentium/browser— browser-specific alternative- Anthropic Computer Use docs