Browser Agents

Agentium supports autonomous browser automation through the BrowserAgent class in @agentium/browser. The agent uses a vision-capable LLM (GPT-4o, Gemini) to interpret screenshots of a browser and decide what actions to take — clicking, typing, scrolling, navigating — until the task is complete.

Browser agents use Playwright under the hood. After installing the package, run npx playwright install chromium to download the browser binary.

Installation

npm install @agentium/browser playwright
npx playwright install chromium

Quick Start

import { BrowserAgent } from "@agentium/browser";
import { openai } from "@agentium/core";

const browser = new BrowserAgent({
  name: "web-navigator",
  model: openai("gpt-4o"),
  startUrl: "https://www.google.com",
  maxSteps: 20,
});

const result = await browser.run(
  "Search for 'TypeScript agent framework' and tell me the first 3 results"
);

console.log(result.success); // true
console.log(result.result);  // "1. LangChain.js — ..."
console.log(result.steps.length); // number of actions taken

How It Works

Launch browser

Playwright opens a Chromium browser (headless by default) and navigates to the start URL.

Take screenshot

A PNG screenshot of the viewport is captured at CSS-pixel resolution (same dimensions as the configured viewport). By default a hit-tested accessibility tree is also extracted and sent alongside it (useDOM: true).

Send to vision model

The screenshot (and DOM tree if enabled) and task description are sent to a vision-capable LLM.

Receive action

The model returns a structured JSON action: click at coordinates, type text, scroll, navigate, etc.

Execute action

The action is executed via Playwright’s browser API.

Repeat or finish

Steps 2-5 repeat until the model returns “done” (task complete) or “fail” (task impossible), or the max step limit is reached.

BrowserAgentConfig

const agent = new BrowserAgent(config: BrowserAgentConfig);

name

string

required

Name of the browser agent.

model

ModelProvider

required

Vision-capable model. Must support image inputs (e.g., openai("gpt-4o"), google("gemini-2.5-flash")).

instructions

string

Extra instructions appended to the system prompt. Use for task-specific guidance.

maxSteps

number

default:"30"

Maximum number of vision loop iterations before the agent gives up.

headless

boolean

default:"true"

Run browser without a visible window. Set to false for debugging and demos.

viewport

{ width: number; height: number }

default:"1280x720"

Browser viewport size in pixels. The model sees screenshots at this resolution.

startUrl

string

Initial URL to navigate to before starting the task.

waitAfterAction

number

default:"1500"

Milliseconds to wait after each action for the page to settle.

maxRepeats

number

default:"3"

Max consecutive identical actions before the agent auto-fails (loop detection).

useDOM

boolean

default:"true"

Include a simplified DOM/accessibility tree alongside the screenshot. Each interactive element is tagged with its exact center coordinate and hit-tested so the listed point is guaranteed to land on the labeled element (not on an overlay). This dramatically improves click accuracy and is on by default. Set to false only if you want a pure-vision flow or need to save tokens on very simple pages.

storageState

string

Path to a Playwright storageState JSON file. Restores cookies, localStorage, and sessionStorage from a previous session. Use this to maintain login state across runs.

recordVideo

boolean | { dir: string }

default:"false"

Enable video recording of the browser session. Pass true for the default directory (./browser-videos) or { dir: "/path" } for a custom location.

stealth

boolean | StealthConfig

default:"false"

Enable anti-bot-detection mode. Patches navigator.webdriver, spoofs plugins, languages, WebGL renderer, and more. Pass true for sensible defaults or a StealthConfig object for fine control (custom user-agent, locale, timezone, geolocation, proxy, deviceScaleFactor). deviceScaleFactor defaults to 1 — set to 2 only if your host display is actually Retina and you want sharper screenshots.

humanize

boolean | HumanizeConfig

default:"false"

Simulate human-like behavior — variable typing speed, jittered click coordinates, Bézier mouse movement curves, random micro-pauses. Pass true for defaults or a HumanizeConfig for fine control.

credentials

CredentialVault

Secure credential store. The LLM only sees named placeholders — real values are injected at execution time and scrubbed from all output.

costTracker

CostTracker

Track vision model token usage and enforce budgets across browser runs. Each vision loop step (screenshot → LLM → action) records its token usage. The same tracker can be shared with text and voice agents for unified cost monitoring.

pageExtractionLLM

ModelProvider

Secondary (usually cheaper) model used for the extract action and any text-only sub-task. Pair model: openai("gpt-4o") with pageExtractionLLM: openai("gpt-4o-mini") to dramatically cut cost on scraping-heavy workloads. Falls back to model if not set.

maxActionsPerStep

number

default:"3"

Maximum number of actions the model can return in a single step. Setting to 1 reverts to the v2.0.x “one action per step” behavior. Higher values speed up form filling significantly.

useVision

boolean | "auto"

default:"\"auto\""

Vision mode. true always sends a screenshot, false never does (DOM-only operation), "auto" (default) sends a screenshot on the first step, when useDOM is false, and whenever the model just used the screenshot action. Significant token savings on text-heavy workloads.

initialActions

BrowserAction[]

Actions to run before the LLM loop starts — saves vision tokens on boilerplate (cookie banners, login flow, scrolling to section). Same schema as actions the model emits.

maxFailures

number

default:"3"

Retry budget for transient failures (invalid JSON from the model, action exceptions, locator timeouts). Separate from maxRepeats which detects the model emitting the same action over and over.

directlyOpenUrl

boolean

default:"true"

If the task string contains a URL and no explicit startUrl is set, the agent navigates there before the first LLM call.

extendSystemMessage

string

Additional instructions appended to the default system prompt. Alias for instructions — both are concatenated.

overrideSystemMessage

string

Completely replace the default system prompt. The credentials and response-format sections are still appended automatically so the runtime contract holds. Most users should prefer instructions / extendSystemMessage instead.

allowEvaluate

boolean

default:"false"

Allow the evaluate action to run arbitrary JavaScript in the page. Off by default for safety — only enable when you trust the source of task strings.

allowedDomains

string[]

Restrict navigation to specific domains. Wildcard patterns supported: "example.com", "*.example.com", "http*://example.com", "*". Any navigate action to a non-matching URL throws.

prohibitedDomains

string[]

Block navigation to specific domains. Same pattern format as allowedDomains. When both are set, a URL must be in allowedDomains AND not in prohibitedDomains.

cdpUrl

string

Connect to an existing browser via Chrome DevTools Protocol instead of launching one ("http://localhost:9222"). When set, headless, stealth.args, recordVideo, etc. are ignored — the existing browser’s configuration is used. Closing the agent detaches without killing the remote browser.

tools

ToolDef[]

Custom tools the BrowserAgent itself can invoke during a run. The agent emits { "action": "tool", "name": "<tool>", "args": {...} } and the runtime dispatches to the tool’s execute(args, ctx). Use for 2FA codes, API calls, file I/O — anything the browser can’t do alone.

logLevel

string

default:"silent"

Logging level: "debug", "info", "warn", "error", "silent".

run()

const result = await agent.run(task: string, opts?: BrowserRunOpts);

task

string

required

Natural language description of what the agent should do in the browser.

opts.startUrl

string

Override the config’s startUrl for this run.

opts.apiKey

string

Per-run API key override for the vision model.

opts.saveStorageState

string

Path to save cookies/auth state after the run completes. Load it back on the next run via storageState in config.

BrowserRunOutput

Field	Type	Description
`result`	`string`	Final text result or failure reason
`success`	`boolean`	Whether the task completed successfully
`steps`	`BrowserStep[]`	Full action history with screenshots
`finalUrl`	`string`	URL at completion
`finalScreenshot`	`Buffer`	Last screenshot (PNG)
`durationMs`	`number`	Total time taken
`videoPath`	`string?`	Video file path (if `recordVideo` was enabled)

Available Actions

The model can choose from these actions at each step. Most actions accept either an index (preferred — resolves against the DOM snapshot we showed the model) or x/y coordinates (fallback).

Element-targeted

Action	Parameters	Description
`click`	`index` \| `x`/`y` + `description`	Click an element. Resolution order: `index` → quoted phrase in `description` → coordinates.
`type`	`index` \| `x`/`y`, `text`, `clear?`, `submit?`	Focus an input and type. Defaults to clearing the field; set `submit: true` to press Enter after typing.
`scroll`	`direction`, `amount?` \| `index`	Scroll the page, or scroll an indexed element into view.
`dropdown_options`	`index`	Read the option list of a native `<select>`.
`select_dropdown`	`index`, `text`	Pick an option in a native `<select>` by visible text/value.
`upload_file`	`index`, `path`	Set a file on an `<input type="file">`.
`find_text`	`text`	Scroll the first occurrence of a phrase into view.

Action	Parameters	Description
`navigate`	`url`	Go to a specific URL. Validated against `allowedDomains`/`prohibitedDomains` if set.
`back`	—	Go back to the previous page.
`wait`	`ms`	Wait for the page to settle (capped at 10s).
`send_keys`	`keys`	Press arbitrary keys / combos / sequences. `"Tab Tab Enter"`, `"Control+l"`, `"Escape"`.
`screenshot`	—	Request a fresh screenshot on the next step (relevant when `useVision: "auto"`).
`evaluate`	`code`	Run arbitrary JS in the page context. Disabled by default; set `allowEvaluate: true` to enable.
`extract`	`query`, `extractLinks?`	Extract information from the current page via `pageExtractionLLM`. Result is fed back to the model on the next step and accumulated in `BrowserRunOutput.extractedContent`.
`tool`	`name`, `args?`	Invoke a custom `ToolDef` registered on the agent. Use for 2FA codes, API calls, file I/O, etc.

Terminal

Action	Parameters	Description
`done`	`result`	Task complete — return the result.
`fail`	`reason`	Task cannot be completed.

Batched actions

The model may return either a single action object or an array of up to maxActionsPerStep (default 3) action objects to execute in order. The runtime stops the batch early on navigation or substantial DOM changes, which makes form filling dramatically faster:

[
  { "action": "type", "index": 5, "text": "Alice" },
  { "action": "type", "index": 6, "text": "Smith" },
  { "action": "type", "index": 7, "text": "alice@example.com" },
  { "action": "click", "index": 12, "description": "the 'Submit' button" }
]

That’s one LLM round-trip instead of four.

Coordinate System & Click Accuracy

BrowserAgent aligns four things so model-decided clicks always land on the right element:

Screenshots are taken in CSS pixels (scale: "css") regardless of the context’s deviceScaleFactor. The image dimensions are always identical to viewport.width × viewport.height — no Retina/DPR mismatch where the model thinks the page is 1280×720 but the image is actually 2560×1440.
page.mouse.click(x, y) uses the same coordinate space as the screenshot — CSS pixels, top-left origin.
Out-of-range coordinates are clamped to the viewport as a safety net, so a slightly off coordinate still hits a reasonable spot instead of silently no-op’ing.
Text-based click fallback (preferred path). When the model’s click description contains a quoted target label (e.g. "Click on 'Cheapest' tab"), the agent first tries a deterministic Playwright text= locator click — DOM-based, substring-matched, pixel-independent. Coordinates are only used when no usable keyword is present or the locator times out. Generic labels like “OK”, “Close”, “Log in” are intentionally skipped to avoid trivially-matching the wrong element. The system prompt instructs the model to put the visible label in quotes so this path can take over.

Window sizing is left to Playwright’s viewport setting — passing --window-size to Chromium fights with viewport (the flag sets the outer window size including browser chrome) and can cause the page to be zoomed out, so it is intentionally not set. deviceScaleFactor defaults to 1 in stealth mode. Forcing 2 on a host display that’s actually DPR=1 (most Windows/Linux setups and many external Mac monitors) causes the headed window to look zoomed out / stretched because the OS compositor downsamples the 2× rendering surface. You can opt in via stealth: { deviceScaleFactor: 2 } if you want sharper screenshots and the host is actually Retina.

DOM Extraction (Hybrid Mode)

By default (useDOM: true), the agent runs in hybrid mode: it sends both the screenshot and a simplified, hit-tested accessibility tree on every step. Each entry in the tree carries a stable per-step index (preferred handle for actions) and its center coordinate (fallback). The format is [idx] [cx,cy] role(type): "label":

[1] [640,300] input(text): "Search..."
[2] [960,45] button: "Sign In"
[3] [120,680] a: "Contact Us"

The model is instructed to use the index when acting (e.g. {"action": "click", "index": 2, "description": "the 'Sign In' button"}). Indexed actions resolve via Playwright’s locator API and are immune to layout shifts, devicePixelRatio drift, and the “sibling 4 pixels away” class of failures that plague raw coordinate clicks. Three properties of this extraction matter for accuracy:

Hit-tested: every listed coordinate is verified via document.elementFromPoint(cx, cy) to actually reach the labeled element. If an overlay, modal, or sticky banner covers the element’s center, the entry is dropped instead of misleading the model into a click that would land on the overlay.
Visibility-filtered: invisible elements (display: none, visibility: hidden, pointer-events: none, opacity near zero, zero-size boxes, off-screen) are excluded.
cursor: pointer fallback pass: in addition to the standard semantic selectors (button, a[href], [role='button'], etc.), the extractor also walks for any element whose computed cursor style is pointer. Modern React/Tailwind apps (FreightOS, Linear, Notion, …) frequently wrap tabs, chips, and cards in plain <div> / <span> with no semantic role, href, or onclick attribute — the only visual signal that they’re interactive is the pointer cursor. Without this pass those elements would be invisible to the agent.

The system prompt instructs the model to prefer DOM-listed coordinates over its own visual estimation whenever a matching entry exists, which makes clicks effectively element-id–accurate without giving up the visual context.

Disabling DOM mode

For very simple pages — or to save tokens — you can opt out:

const agent = new BrowserAgent({
  name: "pure-vision",
  model: openai("gpt-4o"),
  useDOM: false, // pure vision
});

Using `extractDOM` directly

You can also call extractDOM() on a BrowserProvider for ad-hoc element discovery. Returns both the human-readable string (what the model sees) and the structured list (useful for your own routing logic):

import { BrowserProvider } from "@agentium/browser";

const browser = new BrowserProvider();
await browser.launch();
await browser.navigate("https://example.com");

const { text, elements } = await browser.extractDOM({ maxElements: 50 });
console.log(text);

// elements: { index, cx, cy, role, type?, label, tag, isInput, isSelect, isFile }[]
const signIn = elements.find((e) => e.label.toLowerCase().includes("sign in"));
if (signIn) await browser.clickByIndex(signIn.index);

The default maxElements is 120, sorted top-to-bottom and left-to-right for stable ordering.

Maintain login sessions across agent runs using Playwright’s storage state.

const agent = new BrowserAgent({
  name: "auth-agent",
  model: openai("gpt-4o"),
  storageState: "./auth-state.json", // load saved cookies
});

// First run: log in and save the state
const result = await agent.run("Log in with test@example.com", {
  saveStorageState: "./auth-state.json",
});

// Second run: starts already logged in
const result2 = await agent.run("Go to dashboard and get stats");

The storage state file includes cookies, localStorage, and sessionStorage — everything needed to resume an authenticated session.

Stealth Mode (Anti-Detection)

Many websites detect and block headless browsers. Stealth mode patches common detection vectors so the browser appears as a normal user session.

const agent = new BrowserAgent({
  name: "stealth-agent",
  model: openai("gpt-4o"),
  stealth: true,  // sensible defaults
  humanize: true,  // human-like behavior
});

What stealth patches

Vector	What it does
`navigator.webdriver`	Removed (normally `true` in automation)
`navigator.plugins`	Spoofed with realistic Chrome plugins
`navigator.languages`	Set to `["en-US", "en"]`
`navigator.permissions`	Notifications return `"prompt"` instead of `"denied"`
`window.chrome.runtime`	Stubbed to appear like a real Chrome extension API
WebGL renderer	Reports “Intel Iris OpenGL Engine” instead of “SwiftShader”
DOM markers	Removes `cdc_` and `__playwright` attributes
Chrome launch flags	`--disable-blink-features=AutomationControlled`
User-Agent	Rotated from a pool of realistic Chrome/Safari strings

Fine-grained StealthConfig

const agent = new BrowserAgent({
  name: "stealth-agent",
  model: openai("gpt-4o"),
  stealth: {
    userAgent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...",
    locale: "de-DE",
    timezone: "Europe/Berlin",
    geolocation: { latitude: 52.52, longitude: 13.405 },
    deviceScaleFactor: 1, // default. Set to 2 only on Retina hosts.
    proxy: {
      server: "http://proxy.example.com:8080",
      username: "user",
      password: "pass",
    },
  },
});

HumanizeConfig

Makes the browser behave like a real person — variable timing, imprecise clicks, curved mouse paths.

const agent = new BrowserAgent({
  name: "human-agent",
  model: openai("gpt-4o"),
  humanize: {
    typingDelay: [50, 150],   // ms per character (random in range)
    clickJitter: 4,           // ±4px random offset on clicks
    actionDelay: [300, 1000], // random pause between actions
    mouseMovement: true,      // Bézier curve mouse movement
  },
});

Option	Default	Description
`typingDelay`	`[40, 120]`	Min/max ms delay between keystrokes
`clickJitter`	`3`	Random pixel offset added to click coordinates
`actionDelay`	`[200, 800]`	Random pause after each interaction
`mouseMovement`	`true`	Simulate smoothstep mouse curves to target

Video Recording

Record the agent’s entire browser session as a video for debugging, auditing, or demos.

const agent = new BrowserAgent({
  name: "recorded-agent",
  model: openai("gpt-4o"),
  recordVideo: true, // saves to ./browser-videos/
  // or: recordVideo: { dir: "./my-recordings" }
});

const result = await agent.run("Navigate to example.com and take notes");

if (result.videoPath) {
  console.log("Video saved at:", result.videoPath);
}

Playwright generates one video file per browser page. The path is returned in result.videoPath when the run completes.

Parallel Browsing (Multi-Tab)

BrowserProvider supports multiple tabs for advanced workflows:

import { BrowserProvider } from "@agentium/browser";

const browser = new BrowserProvider();
await browser.launch();

// Navigate first tab
await browser.navigate("https://site-a.com");

// Open a second tab
const tab2 = await browser.newTab("https://site-b.com");

// Switch between tabs
await browser.switchTab(tab2);
const screenshot2 = await browser.screenshot();

await browser.switchTab("tab-0"); // back to first tab
const screenshot1 = await browser.screenshot();

// List all open tabs
const tabs = browser.listTabs();
// [{ id: "tab-0", url: "https://site-a.com", active: true },
//  { id: "tab-1", url: "https://site-b.com", active: false }]

// Close a tab
await browser.closeTab(tab2);

await browser.close();

Tab API

Method	Returns	Description
`newTab(url?)`	`string`	Open a new tab, optionally navigate
`switchTab(tabId)`	`void`	Make a tab active
`closeTab(tabId)`	`void`	Close a tab (can’t close the last one)
`listTabs()`	`TabInfo[]`	List all open tabs with URL and active status
`currentTabId`	`string`	Get the active tab’s ID

Browser Gateway (Socket.IO)

Stream browser agent execution over Socket.IO for live observation UIs, dashboards, or remote monitoring.

import express from "express";
import { createServer } from "http";
import { Server } from "socket.io";
import { BrowserAgent } from "@agentium/browser";
import { createBrowserGateway } from "@agentium/transport";
import { openai } from "@agentium/core";

const app = express();
const server = createServer(app);
const io = new Server(server, { cors: { origin: "*" } });

const browserAgent = new BrowserAgent({
  name: "web-scraper",
  model: openai("gpt-4o"),
  headless: true,
  logLevel: "info",
});

createBrowserGateway({
  agents: { scraper: browserAgent },
  io,
  // namespace: "/agentium-browser",     // default
  // streamScreenshots: true,           // default
});

server.listen(3002, () => console.log("Browser gateway on :3002"));

Client Usage

import { io } from "socket.io-client";

const socket = io("http://localhost:3002/agentium-browser");

// Start a browser task
socket.emit("browser.start", {
  agentName: "scraper",
  task: "Go to Hacker News and list the top 5 stories",
  startUrl: "https://news.ycombinator.com",
});

// Live screenshots (base64 PNG)
socket.on("browser.screenshot", ({ data, mimeType }) => {
  const img = document.getElementById("live-view");
  img.src = `data:${mimeType};base64,${data}`;
});

// Each action decided by the model
socket.on("browser.action", ({ action }) => {
  console.log("Agent decided:", action);
});

// Step-by-step progress
socket.on("browser.step", ({ index, action, pageUrl }) => {
  console.log(`Step ${index}: ${action.action} at ${pageUrl}`);
});

// Task complete
socket.on("browser.done", ({ result, success, durationMs, totalSteps }) => {
  console.log(success ? "Done!" : "Failed", result);
  console.log(`Took ${totalSteps} steps in ${durationMs}ms`);
});

// Cancel a running task
socket.emit("browser.stop");

Gateway Events

Direction	Event	Payload
Client → Server	`browser.start`	`{ agentName, task, startUrl?, apiKey? }`
Client → Server	`browser.stop`	—
Server → Client	`browser.started`	`{ agentName, task }`
Server → Client	`browser.screenshot`	`{ data: base64, mimeType }`
Server → Client	`browser.action`	`{ action }`
Server → Client	`browser.step`	`{ index, action, pageUrl, screenshot? }`
Server → Client	`browser.done`	`{ result, success, finalUrl, durationMs, totalSteps, videoPath? }`
Server → Client	`browser.error`	`{ error: string }`
Server → Client	`browser.stopped`	—

BrowserGatewayOptions

agents

Record<string, BrowserAgent>

required

Named BrowserAgent instances. Clients pick one via agentName.

Server

required

Socket.IO server instance.

namespace

string

default:"/agentium-browser"

Socket.IO namespace for the gateway.

streamScreenshots

boolean

default:"true"

Stream live screenshots to clients. Disable for bandwidth-constrained connections.

authMiddleware

(socket, next) => void

Optional authentication middleware applied to the namespace.

Security

URL Validation

The BrowserProvider validates URLs before navigation. Only http:// and https:// schemes are allowed — file://, javascript:, and data: URLs are rejected to prevent local file access and code injection.

TLS Defaults

Stealth mode defaults ignoreHTTPSErrors to false. This means TLS certificate errors are not silently bypassed unless you explicitly configure ignoreHTTPSErrors: true in the StealthConfig. This prevents man-in-the-middle attacks on production deployments.

Memory Safety

Background memory operations (memoryManager.afterRun) include .catch() handlers to prevent unhandled promise rejections from crashing the process.

Loop Detection

The agent detects when it’s stuck repeating the same action:

const agent = new BrowserAgent({
  name: "safe-agent",
  model: openai("gpt-4o"),
  maxRepeats: 3, // auto-fail after 3 identical consecutive actions
});

When the agent repeats the same action more than maxRepeats times, it stops and returns success: false with a descriptive error. This prevents infinite loops caused by popups, consent banners, or ambiguous page states.

Cost Tracking

Browser agents make repeated vision model calls (one per step), which can accumulate significant cost. Use CostTracker to monitor and limit spending:

import { BrowserAgent } from "@agentium/browser";
import { openai, CostTracker } from "@agentium/core";

const tracker = new CostTracker({
  budget: { maxCostPerRun: 2.0 },  // $2 max per browser run
});

const agent = new BrowserAgent({
  name: "web-scraper",
  model: openai("gpt-4o"),
  costTracker: tracker,
  maxSteps: 30,
});

const result = await agent.run("Search for flights from NYC to London");

const summary = tracker.getSummary();
console.log(`Browser run cost: $${summary.totalCost.toFixed(4)}`);
console.log(`Total tokens: ${summary.totalTokens.totalTokens}`);
console.log(`Steps taken: ${result.steps.length}`);

Each step’s vision model call is tracked individually, so you get per-step token granularity in the cost entries.

asTool() — Browser as an Agent Tool

The most powerful pattern: give a regular text agent the ability to browse the web.

import { Agent, openai } from "@agentium/core";
import { BrowserAgent } from "@agentium/browser";

const browser = new BrowserAgent({
  name: "browser",
  model: openai("gpt-4o"),
  headless: true,
});

const agent = new Agent({
  name: "research-assistant",
  model: openai("gpt-4o"),
  instructions: "You help with research. Use the browser tool to look things up.",
  tools: [browser.asTool()],
});

const result = await agent.run(
  "Go to Hacker News and summarize the top 5 stories"
);

The text agent decides when to use the browser and what task to give it. The BrowserAgent handles all the visual navigation autonomously and returns a text result.

browser.asTool({
  name: "browse_web",        // tool name (default)
  description: "...",        // custom description
});

Events

Browser agents emit events via EventBus:

Event	Payload	When
`browser.screenshot`	`{ data: Buffer }`	Screenshot captured
`browser.action`	`{ action }`	Action decided by model
`browser.step`	`{ index, action, pageUrl, screenshot }`	Each loop iteration
`browser.done`	`{ result, success, steps }`	Task completed
`browser.error`	`{ error: Error }`	Error occurred

browser.eventBus.on("browser.action", ({ action }) => {
  console.log("Action:", JSON.stringify(action));
});

browser.eventBus.on("browser.done", ({ result, success }) => {
  console.log(success ? "Completed" : "Failed", result);
});

Tips

Use headless: false

Set headless: false during development to watch the agent navigate in real time.

Keep useDOM on

Hybrid DOM mode is on by default. Only set useDOM: false if you have a reason to use pure-vision (e.g. token budget on very simple pages).

Be specific

Clear, specific task descriptions produce better results than vague ones.

Set a start URL

Always provide a startUrl when possible. Starting from a blank page wastes steps.

Record videos

Use recordVideo: true during development to replay agent sessions.

Persist auth

Use storageState + saveStorageState to avoid re-logging-in every run.

Go stealth

Use stealth: true + humanize: true to bypass bot detection on protected sites.

Secure credentials

Use CredentialVault so the LLM never sees passwords — only placeholders.

Recipes

Cheap extraction with a secondary model

import { BrowserAgent } from "@agentium/browser";
import { openai } from "@agentium/core";

const agent = new BrowserAgent({
  name: "scraper",
  model: openai("gpt-4o"),                 // vision + reasoning
  pageExtractionLLM: openai("gpt-4o-mini"), // text-only extraction
  useVision: "auto",
  initialActions: [
    { action: "navigate", url: "https://news.ycombinator.com" },
  ],
});

const result = await agent.run("Extract the top 10 story titles and point counts.");
console.log(result.extractedContent); // every `extract` action's result, in order

Connect to an existing browser via CDP

const agent = new BrowserAgent({
  name: "attached",
  model: openai("gpt-4o"),
  cdpUrl: "http://localhost:9222",
});

await agent.run("Summarize the page that's currently open.");
// The remote browser keeps running after the agent finishes.

const agent = new BrowserAgent({
  name: "sandboxed",
  model: openai("gpt-4o"),
  allowedDomains: ["*.example.com", "auth.example.org"],
  prohibitedDomains: ["*.tracking.example.com"],
});

Custom tools the BrowserAgent can call mid-loop

import { defineTool } from "@agentium/core";
import { z } from "zod";

const get2FA = defineTool({
  name: "get_2fa",
  description: "Fetch the current TOTP code for our test account.",
  parameters: z.object({}),
  execute: async () => totp.now(SECRET),
});

const agent = new BrowserAgent({
  name: "auth-agent",
  model: openai("gpt-4o"),
  credentials: vault,
  tools: [get2FA],
  instructions: "When a 2FA prompt appears, call the get_2fa tool. NEVER guess.",
});

Pure-DOM mode (no screenshots)

const agent = new BrowserAgent({
  name: "dom-only",
  model: openai("gpt-4o-mini"),
  useVision: false,      // never send screenshots
  useDOM: true,          // (default)
  maxActionsPerStep: 5,
});

Examples

Example	Description
`examples/browser/30-browser-agent.ts`	Standalone browser agent — Hacker News search
`examples/browser/31-browser-as-tool.ts`	Browser as a tool inside a research agent
`examples/browser/32-browser-gateway.ts`	Browser agent streamed over Socket.IO with a live viewer
`examples/browser/33-browser-auth.ts`	Login flow using `CredentialVault` (LLM never sees secrets)

​Browser Agents

​Installation

​Quick Start

​How It Works

​BrowserAgentConfig

​run()

​BrowserRunOutput

​Available Actions

​Element-targeted

​Navigation & I/O

​Terminal

​Batched actions

​Coordinate System & Click Accuracy

​DOM Extraction (Hybrid Mode)

​Disabling DOM mode

​Using extractDOM directly

​Cookie & Auth Persistence

​Stealth Mode (Anti-Detection)

​What stealth patches

​Fine-grained StealthConfig

​HumanizeConfig

​Video Recording

​Parallel Browsing (Multi-Tab)

​Tab API

​Browser Gateway (Socket.IO)

​Client Usage

​Gateway Events

​BrowserGatewayOptions

​Security

​URL Validation

​TLS Defaults

​Memory Safety

​Loop Detection

​Cost Tracking

​asTool() — Browser as an Agent Tool

​Events

​Tips

Use headless: false

Keep useDOM on

Be specific

Set a start URL

Record videos

Persist auth

Go stealth

Secure credentials

​Recipes

​Cheap extraction with a secondary model

​Connect to an existing browser via CDP

​Restrict navigation

​Custom tools the BrowserAgent can call mid-loop

​Pure-DOM mode (no screenshots)

​Examples

Browser Agents

Installation

Quick Start

How It Works

BrowserAgentConfig

run()

BrowserRunOutput

Available Actions

Element-targeted

Navigation & I/O

Terminal

Batched actions

Coordinate System & Click Accuracy

DOM Extraction (Hybrid Mode)

Disabling DOM mode

Using `extractDOM` directly

Cookie & Auth Persistence

Stealth Mode (Anti-Detection)

What stealth patches

Fine-grained StealthConfig

HumanizeConfig

Video Recording

Parallel Browsing (Multi-Tab)

Tab API

Browser Gateway (Socket.IO)

Client Usage

Gateway Events

BrowserGatewayOptions

Security

URL Validation

TLS Defaults

Memory Safety

Loop Detection

Cost Tracking

asTool() — Browser as an Agent Tool

Events

Tips

Recipes

Cheap extraction with a secondary model

Connect to an existing browser via CDP

Restrict navigation

Custom tools the BrowserAgent can call mid-loop

Pure-DOM mode (no screenshots)

Examples