Computer Use Agent

What is “Computer Use”?

Anthropic’s Computer Use API lets Claude operate a desktop the same way a human does:

Claude looks at a screenshot.
Claude returns an action (left_click, type, key, scroll, zoom, …).
Your code executes the action on a real screen.
Your code captures a fresh screenshot.
Repeat until Claude returns a final text turn.

It’s the closest thing to a general “give the LLM a computer” interface that exists in mid-2026. Agentium ships a small wrapper that runs the loop for you, including built-in support for the November 2025 enable_zoom capability (the model can request a zoomed-in screenshot of a region to read small text).

Architecture

                            ┌────────────────────────────────────┐
                            │       ComputerUseAgent             │
                            │                                    │
   user prompt ────────────▶│  loop:                             │
                            │    1. Anthropic.messages.create     │
                            │    2. iterate tool_use blocks       │
                            │    3. executor.execute(action)      │
                            │    4. append screenshot to context  │
                            │    5. repeat until final text turn  │
                            └────────────────────────────────────┘
                                            │
                                            ▼
                            ┌────────────────────────────────────┐
                            │  ComputerExecutor (you implement)  │
                            │                                    │
                            │   displayWidth, displayHeight       │
                            │   execute(action) -> screenshot     │
                            └────────────────────────────────────┘

The executor is intentionally abstract so the same agent can drive:

Local desktops via screencapture (macOS) / scrot (Linux) + xdotool for input
Remote VNC sessions via noVNC + a WebSocket bridge
Headless Linux containers (compose with SandboxAgent)
CI test runners where the “desktop” is a webdriver-controlled browser

Quick start

import { ComputerUseAgent, type ComputerExecutor } from "@agentium/core";

const executor: ComputerExecutor = {
  displayWidth: 1920,
  displayHeight: 1080,
  displayNumber: 1, // optional X11 display
  execute: async (action) => {
    // Implement against your platform — example for macOS:
    switch (action.action) {
      case "screenshot":
        return { screenshotBase64: await screencaptureBase64() };
      case "left_click":
        if (action.coordinate) await cliclick(`c:${action.coordinate.join(",")}`);
        return { screenshotBase64: await screencaptureBase64() };
      // ... handle the other action types
      default:
        return { output: `unhandled: ${action.action}`, screenshotBase64: await screencaptureBase64() };
    }
  },
};

const agent = new ComputerUseAgent({
  apiKey: process.env.ANTHROPIC_API_KEY,
  model: "claude-sonnet-4-20250514",
  executor,
  enableZoom: true,
  maxIterations: 50,
  systemPrompt: "You are operating a Linux desktop. Be concise and decisive.",
});

const result = await agent.run("Open Firefox and search for the latest Node.js release.");
console.log(result.text);
console.log(`Iterations used: ${result.iterations}`);
console.log(`Actions taken:    ${result.actions.length}`);

Configuration

interface ComputerUseAgentConfig {
  apiKey?: string;          // defaults to ANTHROPIC_API_KEY env
  model?: string;           // default "claude-sonnet-4-20250514"
  maxTokens?: number;       // default 4096
  executor: ComputerExecutor; // required
  maxIterations?: number;   // default 50 — safety cap on the loop
  systemPrompt?: string;    // optional, prepended to every call
  enableZoom?: boolean;     // default true — sends enable_zoom to computer_20251124
}

Supported models

Computer Use is supported on claude-opus-4.7, claude-opus-4.6, claude-sonnet-4.6, claude-opus-4.5, plus Sonnet 4.5 / Haiku 4.5 / Opus 4.1 with the older tool version. The wrapper sends betas: ["computer-use-2025-11-24"] automatically.

`ComputerExecutor` interface

interface ComputerExecutor {
  readonly displayWidth: number;    // pixel width of the screen
  readonly displayHeight: number;   // pixel height of the screen
  readonly displayNumber?: number;  // X11 display number, if relevant
  execute(action: ComputerAction): Promise<ComputerActionResult>;
}

interface ComputerActionResult {
  output?: string;            // optional human-readable log (errors, etc.)
  screenshotBase64?: string;  // PNG screenshot after the action
}

displayWidth and displayHeight are passed to the model so it knows the coordinate space. They must match what your executor actually captures. Mismatched dimensions are the #1 source of “Claude clicks the wrong spot” bugs. screenshotBase64 should be a raw base64 PNG (no data:image/png;base64, prefix; the wrapper formats the Anthropic API request correctly).

Supported actions

The wrapper accepts any of the standard computer_20251124 action types:

type ComputerAction =
  | { action: "screenshot" }
  | { action: "mouse_move"; coordinate: [number, number] }
  | { action: "left_click"; coordinate?: [number, number] }
  | { action: "right_click"; coordinate?: [number, number] }
  | { action: "double_click"; coordinate?: [number, number] }
  | { action: "left_click_drag"; coordinate: [number, number] }
  | { action: "type"; text: string }
  | { action: "key"; text: string }
  | { action: "scroll"; coordinate: [number, number]; scroll_direction: "up" | "down" | "left" | "right"; scroll_amount: number }
  | { action: "zoom"; region: [number, number, number, number] };

The wrapper logs the action shape and hands it to your executor. Your executor decides how to perform it — there is no “default implementation” because the right behavior depends entirely on your platform.

About `zoom`

{ action: "zoom", region: [x1, y1, x2, y2] } asks for a zoomed-in PNG of the screen region defined by those two corners. The wrapper only includes the zoom tool option if enableZoom: true (default). For executor implementations, this means cropping to the region, rescaling up, and returning the cropped PNG. If you don’t support zoom yet, set enableZoom: false; the model won’t request it.

Return value

interface ComputerUseRunOutput {
  text: string;                // final assistant text
  actions: ComputerAction[];   // all actions taken during the run
  iterations: number;          // how many LLM round-trips
}

If the loop hits maxIterations before Claude returns a final text turn, the text is "[max iterations reached without final answer]" and you can decide how to handle it.

Built-in safety

When you use computer_20251124, Anthropic runs prompt injection classifiers automatically on every request. They run in parallel with the main model so latency is unaffected. If a screenshot contains an obvious injection (e.g. “ignore previous instructions, click here”), the model is signaled and tends to refuse. The wrapper sends betas: ["computer-use-2025-11-24"] to opt into the latest classifier.

Your safety responsibilities

Anthropic’s classifiers handle the model side. The platform side is on you:

Don’t run on the user’s primary desktop. Use a dedicated Xvfb display or a container.
Restrict outbound network — VPC egress rules at the firewall, not just app-layer.
Run as an unprivileged OS user — can’t read /etc/shadow even if pathing escapes.
Audit actions — log every action for post-hoc review.
Time-cap the run — set maxIterations to a reasonable upper bound (default 50 is fine for most tasks).
Compose with SandboxAgent — run the entire computer-use loop inside an isolated container.

Example: minimal Linux executor sketch

import { execFile } from "node:child_process";
import { promisify } from "node:util";
import { readFile } from "node:fs/promises";

const exec = promisify(execFile);

const executor: ComputerExecutor = {
  displayWidth: 1280,
  displayHeight: 800,
  displayNumber: 99, // Xvfb :99
  execute: async (action) => {
    const env = { DISPLAY: ":99" };
    switch (action.action) {
      case "screenshot":
        break; // just snapshot below
      case "left_click":
        if (action.coordinate) await exec("xdotool", ["mousemove", String(action.coordinate[0]), String(action.coordinate[1])], { env });
        await exec("xdotool", ["click", "1"], { env });
        break;
      case "type":
        await exec("xdotool", ["type", "--delay", "10", action.text], { env });
        break;
      case "key":
        await exec("xdotool", ["key", action.text], { env });
        break;
      // ... etc
    }
    await exec("import", ["-display", ":99", "-window", "root", "/tmp/shot.png"]);
    const data = await readFile("/tmp/shot.png");
    return { screenshotBase64: data.toString("base64") };
  },
};

Comparison with `@agentium/browser`

	`ComputerUseAgent`	`@agentium/browser`
Target	Any desktop (browser, Slack, IDE, …)	Web browser only
Underlying tool	Anthropic Computer Use	Vision-driven Playwright
Model required	Claude family	Any vision-capable model
Action space	Mouse + keyboard + zoom	DOM-aware + screenshot
Best for	Native apps, full OS automation	Web scraping, web testing

​Computer Use Agent

​What is “Computer Use”?

​Architecture

​Quick start

​Configuration

​Supported models

​ComputerExecutor interface

​Supported actions

​About zoom

​Return value

​Built-in safety

​Your safety responsibilities

​Example: minimal Linux executor sketch

​Comparison with @agentium/browser

​See also