Semantic Cache

Semantic caching stores LLM responses indexed by the semantic meaning of the input. When a similar query arrives, the cached response is returned without calling the LLM — reducing costs and latency.

Quick Start

import { Agent, openai, InMemoryVectorStore, OpenAIEmbedding } from "@agentium/core";

const agent = new Agent({
  name: "assistant",
  model: openai("gpt-4o"),
  semanticCache: {
    vectorStore: new InMemoryVectorStore(new OpenAIEmbedding()),
    embedding: new OpenAIEmbedding(),
    similarityThreshold: 0.92,
    scope: "agent",
  },
});

// First call: LLM call, result cached
await agent.run("What is the capital of France?");

// Second call: returns from cache (no LLM call)
await agent.run("What's the capital of France?");

Configuration

interface SemanticCacheConfig {
  vectorStore: VectorStore;        // Any vector store backend
  embedding: EmbeddingProvider;    // Embedding model for similarity
  similarityThreshold?: number;    // 0-1, default 0.92
  ttl?: number;                    // Cache expiry in ms
  collection?: string;             // Vector collection name
  scope?: "global" | "agent" | "session";
}

Scope

Scope	Behavior
`global`	All agents share one cache
`agent`	Each agent has its own cache partition
`session`	Each session has its own cache partition

How It Works

Before calling the LLM, the input is embedded and searched against the vector store
If a result exceeds the similarityThreshold, it’s returned as a cache hit
Output guardrails still run on cached responses
After an LLM call, the input + output are stored in the vector store (fire-and-forget)
TTL is enforced on lookup — expired entries are evicted lazily

Events

Event	Payload
`cache.hit`	`{ agentName, input, cachedId }`
`cache.miss`	`{ agentName, input }`

Supported Backends

Any VectorStore implementation works: InMemoryVectorStore, QdrantVectorStore, MongoDBVectorStore, PgVectorStore.

Backend Examples

InMemory (Development)

import { Agent, openai, InMemoryVectorStore, OpenAIEmbedding } from "@agentium/core";

const embedding = new OpenAIEmbedding();

const agent = new Agent({
  name: "assistant",
  model: openai("gpt-4o"),
  semanticCache: {
    vectorStore: new InMemoryVectorStore(embedding),
    embedding,
    similarityThreshold: 0.92,
  },
});

Fast, zero-config. Cache is lost when the process restarts — ideal for development and testing.

Qdrant (Production)

import { Agent, openai, QdrantVectorStore, OpenAIEmbedding } from "@agentium/core";

const embedding = new OpenAIEmbedding();

const agent = new Agent({
  name: "assistant",
  model: openai("gpt-4o"),
  semanticCache: {
    vectorStore: new QdrantVectorStore({
      url: "http://localhost:6333",
      collection: "semantic_cache",
      embedding,
    }),
    embedding,
    similarityThreshold: 0.90,
    ttl: 3600_000, // 1 hour
  },
});

PgVector (PostgreSQL)

import { Agent, openai, PgVectorStore, OpenAIEmbedding } from "@agentium/core";

const embedding = new OpenAIEmbedding();

const agent = new Agent({
  name: "assistant",
  model: openai("gpt-4o"),
  semanticCache: {
    vectorStore: new PgVectorStore({
      connectionString: "postgresql://localhost:5432/myapp",
      table: "semantic_cache",
      embedding,
    }),
    embedding,
    similarityThreshold: 0.92,
  },
});

Cache Hit vs Miss Behavior

const agent = new Agent({
  name: "assistant",
  model: openai("gpt-4o"),
  semanticCache: {
    vectorStore: new InMemoryVectorStore(new OpenAIEmbedding()),
    embedding: new OpenAIEmbedding(),
    similarityThreshold: 0.92,
    ttl: 60_000, // 1 minute
  },
});

// Listen to cache events
agent.on("cache.hit", ({ input, cachedId }) => {
  console.log(`Cache HIT for: "${input}" (id: ${cachedId})`);
});
agent.on("cache.miss", ({ input }) => {
  console.log(`Cache MISS for: "${input}"`);
});

// First call: MISS — calls LLM, stores result
await agent.run("What is the capital of France?");
// → Cache MISS for: "What is the capital of France?"

// Semantically similar: HIT — returns cached result (no LLM call)
await agent.run("What's France's capital city?");
// → Cache HIT for: "What's France's capital city?"

// Different enough: MISS
await agent.run("What is the population of France?");
// → Cache MISS for: "What is the population of France?"

// After TTL expires: MISS again
// (wait 60 seconds...)
await agent.run("What is the capital of France?");
// → Cache MISS for: "What is the capital of France?"

Tuning `similarityThreshold`

Threshold	Behavior
`0.98+`	Nearly exact matches only
`0.92-0.95`	Good default — catches rephrasings
`0.85-0.90`	Aggressive caching — may return irrelevant results
`< 0.85`	Not recommended — too many false matches

Start with 0.92 and adjust based on your cache hit rate and quality.

Cross-References

Tool Caching — Cache individual tool results (different from semantic cache)
Cost Tracking — Semantic cache reduces LLM costs; track savings with CostTracker

Cost Tracking Eval Framework

​Semantic Cache

​Quick Start

​Configuration

​Scope

​How It Works

​Events

​Supported Backends

​Backend Examples

​InMemory (Development)

​Qdrant (Production)

​PgVector (PostgreSQL)

​Cache Hit vs Miss Behavior

​Tuning similarityThreshold

​Cross-References