Skip to main content

Voice Agents

RadarOS supports real-time voice conversations through the VoiceAgent class. Voice agents connect to speech-to-speech APIs (OpenAI Realtime, Google Gemini Live) over WebSocket, handle audio streaming, tool calling, and persistent user memory — all with the same patterns as regular text agents.
Voice agents use a separate RealtimeProvider interface (not the regular ModelProvider). The realtime API manages its own conversation context within the WebSocket connection.

Quick Start

npm install @radaros/core ws
import { VoiceAgent, openaiRealtime, defineTool } from "@radaros/core";
import { z } from "zod";

const weatherTool = defineTool({
  name: "getWeather",
  description: "Get weather for a city",
  parameters: z.object({ city: z.string() }),
  execute: async ({ city }) => `${city}: 22°C, sunny`,
});

const agent = new VoiceAgent({
  name: "assistant",
  provider: openaiRealtime("gpt-4o-realtime-preview"),
  instructions: "You are a helpful voice assistant.",
  tools: [weatherTool],
  voice: "alloy",
});

const session = await agent.connect();

// Send audio from a microphone
session.sendAudio(pcmBuffer);

// Listen for responses
session.on("audio", ({ data }) => { /* play PCM audio */ });
session.on("transcript", ({ text, role }) => console.log(`[${role}] ${text}`));

// Clean up
await session.close();

Architecture

Voice agents have a layered architecture:
Browser/Client
    ↕ Socket.IO (audio + events)
Voice Gateway (@radaros/transport)
    ↕ events
VoiceAgent (@radaros/core)
    ↕ WebSocket
RealtimeProvider (OpenAI / Google)

VoiceAgent

Orchestrator. Manages the realtime connection, tools, user memory, and session lifecycle.

RealtimeProvider

WebSocket adapter for a specific speech-to-speech API. Translates between RadarOS events and the provider’s protocol.

Voice Gateway

Thin Socket.IO relay. Bridges browser audio to VoiceAgent. No business logic.

VoiceAgent Config

const agent = new VoiceAgent(config: VoiceAgentConfig);
name
string
required
Name of the voice agent.
provider
RealtimeProvider
required
The realtime provider to use. Use the shorthand helpers openaiRealtime() or googleLive(), or instantiate OpenAIRealtimeProvider / GoogleLiveProvider directly.
instructions
string
System instructions for the voice agent. User memory facts are automatically appended on connect.
tools
ToolDef[]
Tools the agent can call during a voice conversation. Same defineTool() API as regular agents.
voice
string
Voice to use for speech synthesis (e.g., "alloy", "shimmer", "echo"). Provider-specific.
userMemory
UserMemory
Cross-session user memory. Facts are loaded into instructions on connect and auto-extracted from transcripts on disconnect.
model
ModelProvider
LLM model used by UserMemory for auto-extracting facts from conversation transcripts. Required when userMemory is set.
userId
string
Default user ID. Can be overridden per connect() call.
temperature
number
Temperature for response generation.
turnDetection
TurnDetectionConfig | null
Server-side voice activity detection config. Set to null to disable.
logLevel
string
default:"silent"
Logging level: "debug", "info", "warn", "error", "silent".

connect()

Call connect() to start a voice session:
const session = await agent.connect({
  apiKey: "sk-...",   // optional per-session key override
  userId: "akash",    // identifies the user for memory
  sessionId: "s-123", // optional session identifier
});
On connect, the agent:
  1. Loads user facts from UserMemory (if configured) and appends them to instructions
  2. Opens a WebSocket to the realtime provider
  3. Sends session config (instructions, tools, voice, etc.)
  4. Returns a VoiceSession handle

VoiceSession

The session handle returned by connect():
MethodDescription
sendAudio(data: Buffer)Send raw PCM audio to the agent
sendText(text: string)Send a text message (triggers a spoken response)
interrupt()Interrupt the current response
close()End the session. Triggers user memory extraction.

Events

EventPayloadDescription
audio{ data: Buffer, mimeType: string }Audio response chunk (PCM16)
transcript{ text: string, role: "user" | "assistant" }Speech-to-text transcript
text{ text: string }Text-only response delta
tool_call_start{ name: string, args: unknown }Tool call initiated
tool_result{ name: string, result: string }Tool call completed
interrupted{}Response was interrupted
error{ error: Error }Error occurred
disconnected{}Session ended

Realtime Providers

OpenAI Realtime

import { openaiRealtime } from "@radaros/core";

const provider = openaiRealtime("gpt-4o-realtime-preview", {
  apiKey: "sk-...",   // optional, defaults to OPENAI_API_KEY env
  baseURL: "wss://...", // optional custom endpoint
});
Requires: npm install ws

Google Gemini Live

import { googleLive } from "@radaros/core";

const provider = googleLive("gemini-2.0-flash-live-001", {
  apiKey: "...",  // optional, defaults to GOOGLE_API_KEY env
});
Requires: npm install @google/genai
Both openaiRealtime() and googleLive() are shorthand helpers that return a RealtimeProvider. They mirror the openai() / google() pattern used for text models. The class exports (OpenAIRealtimeProvider, GoogleLiveProvider) are still available for advanced use.

User Memory in Voice

Voice agents support the same UserMemory as regular agents. The flow:
1

User connects

connect({ userId: "akash" }) loads stored facts and appends them to the agent’s instructions.
2

Conversation happens

The agent knows the user’s name, preferences, etc. from the injected facts.
3

User disconnects

On close() or disconnect, all transcripts are consolidated (small deltas merged into full messages) and sent to the LLM for fact extraction.
4

Facts are stored

New facts are deduplicated and saved. Next time the user connects, they’re automatically loaded.
import { VoiceAgent, openaiRealtime, openai, UserMemory, MongoDBStorage } from "@radaros/core";

const storage = new MongoDBStorage("mongodb://localhost:27017", "myapp", "voice_data");
const userMemory = new UserMemory({ storage, maxFacts: 200 });

const agent = new VoiceAgent({
  name: "assistant",
  provider: openaiRealtime("gpt-4o-realtime-preview"),
  userMemory,
  model: openai("gpt-4o-mini"), // for fact extraction
  instructions: "You are a helpful voice assistant.",
  voice: "alloy",
});

// User "akash" connects — their stored facts are loaded automatically
const session = await agent.connect({ userId: "akash" });
Voice agents do not use the Memory class (long-term summarization) or SessionManager. The realtime API manages its own conversation context within the WebSocket connection. Only UserMemory persists across sessions.

Tool Calling

Tools work the same as regular agents. When the realtime API detects a tool call intent:
  1. The provider emits a tool_call event
  2. VoiceAgent executes the tool via ToolExecutor
  3. The result is sent back to the provider
  4. The agent speaks the result
const trackShipment = defineTool({
  name: "trackShipment",
  description: "Track a shipment by tracking number",
  parameters: z.object({
    trackingNumber: z.string(),
  }),
  execute: async ({ trackingNumber }) => {
    const res = await fetch(`https://api.example.com/track?id=${trackingNumber}`);
    const data = await res.json();
    return `Status: ${data.status}, ETA: ${data.eta}`;
  },
});

const agent = new VoiceAgent({
  name: "logistics",
  provider: openaiRealtime("gpt-4o-realtime-preview"),
  tools: [trackShipment],
  instructions: "You help track shipments. Ask for the tracking number.",
});

Voice Gateway (Socket.IO)

For browser-based voice apps, use the createVoiceGateway from @radaros/transport:
npm install @radaros/transport express socket.io
import express from "express";
import { createServer } from "http";
import { Server as SocketIOServer } from "socket.io";
import { VoiceAgent, openaiRealtime } from "@radaros/core";
import { createVoiceGateway } from "@radaros/transport";

const agent = new VoiceAgent({
  name: "assistant",
  provider: openaiRealtime("gpt-4o-realtime-preview"),
  instructions: "You are a voice assistant.",
  voice: "alloy",
});

const app = express();
const httpServer = createServer(app);
const io = new SocketIOServer(httpServer, { cors: { origin: "*" } });

createVoiceGateway({
  agents: { assistant: agent },
  io,
  namespace: "/voice",
});

httpServer.listen(3001);
The gateway is a thin relay — it forwards Socket.IO events to the VoiceAgent and streams audio/events back. All memory, session, and tool logic lives in the agent.

Client-Side Events

Event (emit)PayloadDescription
voice.start{ agentName, userId?, apiKey? }Start a voice session
voice.audio{ data: base64 }Send mic audio (PCM16, base64)
voice.text{ text: string }Send text input
voice.interruptInterrupt the current response
voice.stopEnd the session
Event (listen)PayloadDescription
voice.started{ userId }Session started
voice.audio{ data: base64, mimeType }Audio response (PCM16, base64)
voice.transcript{ text, role }Transcript delta
voice.tool.call{ name, args }Tool call started
voice.tool.result{ name, result }Tool call result
voice.interruptedResponse interrupted
voice.error{ error: string }Error
voice.stoppedSession ended

Examples

ExampleDescription
examples/voice/26-voice-openai.tsOpenAI voice agent with mic/speaker
examples/voice/27-voice-google.tsGoogle Gemini Live voice agent
examples/voice/29-voice-socketio.tsFull browser voice app with Socket.IO, tools, and unified memory