Browser Agents
Agentium supports autonomous browser automation through theBrowserAgent class in @agentium/browser. The agent uses a vision-capable LLM (GPT-4o, Gemini) to interpret screenshots of a browser and decide what actions to take — clicking, typing, scrolling, navigating — until the task is complete.
Browser agents use Playwright under the hood. After installing the package, run
npx playwright install chromium to download the browser binary.Installation
Quick Start
How It Works
Launch browser
Playwright opens a Chromium browser (headless by default) and navigates to the start URL.
Take screenshot
A PNG screenshot of the viewport is captured at CSS-pixel resolution
(same dimensions as the configured viewport). By default a hit-tested
accessibility tree is also extracted and sent alongside it
(
useDOM: true).Send to vision model
The screenshot (and DOM tree if enabled) and task description are sent to a vision-capable LLM.
Receive action
The model returns a structured JSON action: click at coordinates, type text, scroll, navigate, etc.
BrowserAgentConfig
Name of the browser agent.
Vision-capable model. Must support image inputs (e.g.,
openai("gpt-4o"), google("gemini-2.5-flash")).Extra instructions appended to the system prompt. Use for task-specific guidance.
Maximum number of vision loop iterations before the agent gives up.
Run browser without a visible window. Set to
false for debugging and demos.Browser viewport size in pixels. The model sees screenshots at this resolution.
Initial URL to navigate to before starting the task.
Milliseconds to wait after each action for the page to settle.
Max consecutive identical actions before the agent auto-fails (loop detection).
Include a simplified DOM/accessibility tree alongside the screenshot. Each
interactive element is tagged with its exact center coordinate and
hit-tested so the listed point is guaranteed to land on the labeled
element (not on an overlay). This dramatically improves click accuracy and
is on by default. Set to
false only if you want a pure-vision flow or
need to save tokens on very simple pages.Path to a Playwright storageState JSON file. Restores cookies, localStorage, and sessionStorage from a previous session. Use this to maintain login state across runs.
Enable video recording of the browser session. Pass
true for the default directory (./browser-videos) or { dir: "/path" } for a custom location.Enable anti-bot-detection mode. Patches
navigator.webdriver, spoofs plugins, languages, WebGL renderer, and more. Pass true for sensible defaults or a StealthConfig object for fine control (custom user-agent, locale, timezone, geolocation, proxy, deviceScaleFactor). deviceScaleFactor defaults to 1 — set to 2 only if your host display is actually Retina and you want sharper screenshots.Simulate human-like behavior — variable typing speed, jittered click coordinates, Bézier mouse movement curves, random micro-pauses. Pass
true for defaults or a HumanizeConfig for fine control.Secure credential store. The LLM only sees named placeholders — real values are injected at execution time and scrubbed from all output.
Track vision model token usage and enforce budgets across browser runs. Each vision loop step (screenshot → LLM → action) records its token usage. The same tracker can be shared with text and voice agents for unified cost monitoring.
Secondary (usually cheaper) model used for the
extract action and any
text-only sub-task. Pair model: openai("gpt-4o") with
pageExtractionLLM: openai("gpt-4o-mini") to dramatically cut cost on
scraping-heavy workloads. Falls back to model if not set.Maximum number of actions the model can return in a single step. Setting
to 1 reverts to the v2.0.x “one action per step” behavior. Higher values
speed up form filling significantly.
Vision mode.
true always sends a screenshot, false never does
(DOM-only operation), "auto" (default) sends a screenshot on the first
step, when useDOM is false, and whenever the model just used the
screenshot action. Significant token savings on text-heavy workloads.Actions to run before the LLM loop starts — saves vision tokens on
boilerplate (cookie banners, login flow, scrolling to section). Same
schema as actions the model emits.
Retry budget for transient failures (invalid JSON from the model, action
exceptions, locator timeouts). Separate from
maxRepeats which detects
the model emitting the same action over and over.If the task string contains a URL and no explicit
startUrl is set, the
agent navigates there before the first LLM call.Additional instructions appended to the default system prompt. Alias for
instructions — both are concatenated.Completely replace the default system prompt. The credentials and
response-format sections are still appended automatically so the runtime
contract holds. Most users should prefer
instructions /
extendSystemMessage instead.Allow the
evaluate action to run arbitrary JavaScript in the page. Off
by default for safety — only enable when you trust the source of task
strings.Restrict navigation to specific domains. Wildcard patterns supported:
"example.com", "*.example.com", "http*://example.com", "*".
Any navigate action to a non-matching URL throws.Block navigation to specific domains. Same pattern format as
allowedDomains. When both are set, a URL must be in allowedDomains
AND not in prohibitedDomains.Connect to an existing browser via Chrome DevTools Protocol instead of
launching one (
"http://localhost:9222"). When set, headless,
stealth.args, recordVideo, etc. are ignored — the existing browser’s
configuration is used. Closing the agent detaches without killing the
remote browser.Custom tools the BrowserAgent itself can invoke during a run. The agent
emits
{ "action": "tool", "name": "<tool>", "args": {...} } and the
runtime dispatches to the tool’s execute(args, ctx). Use for 2FA
codes, API calls, file I/O — anything the browser can’t do alone.Logging level:
"debug", "info", "warn", "error", "silent".run()
Natural language description of what the agent should do in the browser.
Override the config’s
startUrl for this run.Per-run API key override for the vision model.
Path to save cookies/auth state after the run completes. Load it back on the next run via
storageState in config.BrowserRunOutput
| Field | Type | Description |
|---|---|---|
result | string | Final text result or failure reason |
success | boolean | Whether the task completed successfully |
steps | BrowserStep[] | Full action history with screenshots |
finalUrl | string | URL at completion |
finalScreenshot | Buffer | Last screenshot (PNG) |
durationMs | number | Total time taken |
videoPath | string? | Video file path (if recordVideo was enabled) |
Available Actions
The model can choose from these actions at each step. Most actions accept either anindex (preferred — resolves against the DOM snapshot we showed
the model) or x/y coordinates (fallback).
Element-targeted
| Action | Parameters | Description |
|---|---|---|
click | index | x/y + description | Click an element. Resolution order: index → quoted phrase in description → coordinates. |
type | index | x/y, text, clear?, submit? | Focus an input and type. Defaults to clearing the field; set submit: true to press Enter after typing. |
scroll | direction, amount? | index | Scroll the page, or scroll an indexed element into view. |
dropdown_options | index | Read the option list of a native <select>. |
select_dropdown | index, text | Pick an option in a native <select> by visible text/value. |
upload_file | index, path | Set a file on an <input type="file">. |
find_text | text | Scroll the first occurrence of a phrase into view. |
Navigation & I/O
| Action | Parameters | Description |
|---|---|---|
navigate | url | Go to a specific URL. Validated against allowedDomains/prohibitedDomains if set. |
back | — | Go back to the previous page. |
wait | ms | Wait for the page to settle (capped at 10s). |
send_keys | keys | Press arbitrary keys / combos / sequences. "Tab Tab Enter", "Control+l", "Escape". |
screenshot | — | Request a fresh screenshot on the next step (relevant when useVision: "auto"). |
evaluate | code | Run arbitrary JS in the page context. Disabled by default; set allowEvaluate: true to enable. |
extract | query, extractLinks? | Extract information from the current page via pageExtractionLLM. Result is fed back to the model on the next step and accumulated in BrowserRunOutput.extractedContent. |
tool | name, args? | Invoke a custom ToolDef registered on the agent. Use for 2FA codes, API calls, file I/O, etc. |
Terminal
| Action | Parameters | Description |
|---|---|---|
done | result | Task complete — return the result. |
fail | reason | Task cannot be completed. |
Batched actions
The model may return either a single action object or an array of up tomaxActionsPerStep (default 3) action objects to execute in order. The
runtime stops the batch early on navigation or substantial DOM changes,
which makes form filling dramatically faster:
Coordinate System & Click Accuracy
BrowserAgent aligns four things so model-decided clicks always land on the
right element:
- Screenshots are taken in CSS pixels (
scale: "css") regardless of the context’sdeviceScaleFactor. The image dimensions are always identical toviewport.width × viewport.height— no Retina/DPR mismatch where the model thinks the page is 1280×720 but the image is actually 2560×1440. page.mouse.click(x, y)uses the same coordinate space as the screenshot — CSS pixels, top-left origin.- Out-of-range coordinates are clamped to the viewport as a safety net, so a slightly off coordinate still hits a reasonable spot instead of silently no-op’ing.
- Text-based click fallback (preferred path). When the model’s click
descriptioncontains a quoted target label (e.g."Click on 'Cheapest' tab"), the agent first tries a deterministic Playwrighttext=locator click — DOM-based, substring-matched, pixel-independent. Coordinates are only used when no usable keyword is present or the locator times out. Generic labels like “OK”, “Close”, “Log in” are intentionally skipped to avoid trivially-matching the wrong element. The system prompt instructs the model to put the visible label in quotes so this path can take over.
viewport setting — passing
--window-size to Chromium fights with viewport (the flag sets the outer
window size including browser chrome) and can cause the page to be zoomed
out, so it is intentionally not set.
deviceScaleFactor defaults to 1 in stealth mode. Forcing 2 on a host
display that’s actually DPR=1 (most Windows/Linux setups and many external
Mac monitors) causes the headed window to look zoomed out / stretched
because the OS compositor downsamples the 2× rendering surface. You can
opt in via stealth: { deviceScaleFactor: 2 } if you want sharper
screenshots and the host is actually Retina.
DOM Extraction (Hybrid Mode)
By default (useDOM: true), the agent runs in hybrid mode: it sends both
the screenshot and a simplified, hit-tested accessibility tree on every step.
Each entry in the tree carries a stable per-step index (preferred handle
for actions) and its center coordinate (fallback). The format is
[idx] [cx,cy] role(type): "label":
index when acting (e.g.
{"action": "click", "index": 2, "description": "the 'Sign In' button"}).
Indexed actions resolve via Playwright’s locator API and are immune to
layout shifts, devicePixelRatio drift, and the “sibling 4 pixels away”
class of failures that plague raw coordinate clicks.
Three properties of this extraction matter for accuracy:
- Hit-tested: every listed coordinate is verified via
document.elementFromPoint(cx, cy)to actually reach the labeled element. If an overlay, modal, or sticky banner covers the element’s center, the entry is dropped instead of misleading the model into a click that would land on the overlay. - Visibility-filtered: invisible elements
(
display: none,visibility: hidden,pointer-events: none, opacity near zero, zero-size boxes, off-screen) are excluded. cursor: pointerfallback pass: in addition to the standard semantic selectors (button,a[href],[role='button'], etc.), the extractor also walks for any element whose computedcursorstyle ispointer. Modern React/Tailwind apps (FreightOS, Linear, Notion, …) frequently wrap tabs, chips, and cards in plain<div>/<span>with no semantic role, href, or onclick attribute — the only visual signal that they’re interactive is the pointer cursor. Without this pass those elements would be invisible to the agent.
Disabling DOM mode
For very simple pages — or to save tokens — you can opt out:Using extractDOM directly
You can also call extractDOM() on a BrowserProvider for ad-hoc element
discovery. Returns both the human-readable string (what the model sees) and
the structured list (useful for your own routing logic):
maxElements is 120, sorted top-to-bottom and left-to-right for
stable ordering.
Cookie & Auth Persistence
Maintain login sessions across agent runs using Playwright’s storage state.Stealth Mode (Anti-Detection)
Many websites detect and block headless browsers. Stealth mode patches common detection vectors so the browser appears as a normal user session.What stealth patches
| Vector | What it does |
|---|---|
navigator.webdriver | Removed (normally true in automation) |
navigator.plugins | Spoofed with realistic Chrome plugins |
navigator.languages | Set to ["en-US", "en"] |
navigator.permissions | Notifications return "prompt" instead of "denied" |
window.chrome.runtime | Stubbed to appear like a real Chrome extension API |
| WebGL renderer | Reports “Intel Iris OpenGL Engine” instead of “SwiftShader” |
| DOM markers | Removes cdc_ and __playwright attributes |
| Chrome launch flags | --disable-blink-features=AutomationControlled |
| User-Agent | Rotated from a pool of realistic Chrome/Safari strings |
Fine-grained StealthConfig
HumanizeConfig
Makes the browser behave like a real person — variable timing, imprecise clicks, curved mouse paths.| Option | Default | Description |
|---|---|---|
typingDelay | [40, 120] | Min/max ms delay between keystrokes |
clickJitter | 3 | Random pixel offset added to click coordinates |
actionDelay | [200, 800] | Random pause after each interaction |
mouseMovement | true | Simulate smoothstep mouse curves to target |
Video Recording
Record the agent’s entire browser session as a video for debugging, auditing, or demos.result.videoPath when the run completes.
Parallel Browsing (Multi-Tab)
BrowserProvider supports multiple tabs for advanced workflows:
Tab API
| Method | Returns | Description |
|---|---|---|
newTab(url?) | string | Open a new tab, optionally navigate |
switchTab(tabId) | void | Make a tab active |
closeTab(tabId) | void | Close a tab (can’t close the last one) |
listTabs() | TabInfo[] | List all open tabs with URL and active status |
currentTabId | string | Get the active tab’s ID |
Browser Gateway (Socket.IO)
Stream browser agent execution over Socket.IO for live observation UIs, dashboards, or remote monitoring.Client Usage
Gateway Events
| Direction | Event | Payload |
|---|---|---|
| Client → Server | browser.start | { agentName, task, startUrl?, apiKey? } |
| Client → Server | browser.stop | — |
| Server → Client | browser.started | { agentName, task } |
| Server → Client | browser.screenshot | { data: base64, mimeType } |
| Server → Client | browser.action | { action } |
| Server → Client | browser.step | { index, action, pageUrl, screenshot? } |
| Server → Client | browser.done | { result, success, finalUrl, durationMs, totalSteps, videoPath? } |
| Server → Client | browser.error | { error: string } |
| Server → Client | browser.stopped | — |
BrowserGatewayOptions
Named BrowserAgent instances. Clients pick one via
agentName.Socket.IO server instance.
Socket.IO namespace for the gateway.
Stream live screenshots to clients. Disable for bandwidth-constrained connections.
Optional authentication middleware applied to the namespace.
Security
URL Validation
TheBrowserProvider validates URLs before navigation. Only http:// and https:// schemes are allowed — file://, javascript:, and data: URLs are rejected to prevent local file access and code injection.
TLS Defaults
Stealth mode defaultsignoreHTTPSErrors to false. This means TLS certificate errors are not silently bypassed unless you explicitly configure ignoreHTTPSErrors: true in the StealthConfig. This prevents man-in-the-middle attacks on production deployments.
Memory Safety
Background memory operations (memoryManager.afterRun) include .catch() handlers to prevent unhandled promise rejections from crashing the process.
Loop Detection
The agent detects when it’s stuck repeating the same action:maxRepeats times, it stops and returns success: false with a descriptive error. This prevents infinite loops caused by popups, consent banners, or ambiguous page states.
Cost Tracking
Browser agents make repeated vision model calls (one per step), which can accumulate significant cost. UseCostTracker to monitor and limit spending:
asTool() — Browser as an Agent Tool
The most powerful pattern: give a regular text agent the ability to browse the web.Events
Browser agents emit events viaEventBus:
| Event | Payload | When |
|---|---|---|
browser.screenshot | { data: Buffer } | Screenshot captured |
browser.action | { action } | Action decided by model |
browser.step | { index, action, pageUrl, screenshot } | Each loop iteration |
browser.done | { result, success, steps } | Task completed |
browser.error | { error: Error } | Error occurred |
Tips
Use headless: false
Set
headless: false during development to watch the agent navigate in real time.Keep useDOM on
Hybrid DOM mode is on by default. Only set
useDOM: false if you have a reason to use pure-vision (e.g. token budget on very simple pages).Be specific
Clear, specific task descriptions produce better results than vague ones.
Set a start URL
Always provide a
startUrl when possible. Starting from a blank page wastes steps.Record videos
Use
recordVideo: true during development to replay agent sessions.Persist auth
Use
storageState + saveStorageState to avoid re-logging-in every run.Go stealth
Use
stealth: true + humanize: true to bypass bot detection on protected sites.Secure credentials
Use
CredentialVault so the LLM never sees passwords — only placeholders.Recipes
Cheap extraction with a secondary model
Connect to an existing browser via CDP
Restrict navigation
Custom tools the BrowserAgent can call mid-loop
Pure-DOM mode (no screenshots)
Examples
| Example | Description |
|---|---|
examples/browser/30-browser-agent.ts | Standalone browser agent — Hacker News search |
examples/browser/31-browser-as-tool.ts | Browser as a tool inside a research agent |
examples/browser/32-browser-gateway.ts | Browser agent streamed over Socket.IO with a live viewer |
examples/browser/33-browser-auth.ts | Login flow using CredentialVault (LLM never sees secrets) |