Zero-Cost Intent Routing: Benchmarking window.ai & Edge Vector DBs in Agent Loops

When building browser-native AI agents, developers face a major bottleneck: latency and cost. Round-trips to cloud-based LLM providers for simple intent classification or local searches can add hundreds of milliseconds of delay and add up to heavy API bills.

To bridge this gap, the modern web runtime allows us to run models and data retrieval engines entirely client-side. By combining Chrome's built-in Gemini Nano (window.ai) with WebAssembly (Wasm) powered vector indexes, we achieve zero-cost, sub-10ms intent classification and edge-based RAG.

The Built-in AI Advantage: window.ai

The upcoming browser standard window.ai exposes the browser's built-in large language model (like Gemini Nano in Google Chrome) directly to JavaScript.

Instead of routing every single agent heartbeat to a remote endpoint, we can delegate low-level classification tasks to the local engine:

TS.SNIPPET

// ⚡ Instantiating Chrome's Native Gemini Nano
async function classifyIntentLocal(userInput: string): Promise<string> {
  if (!window.ai || !window.ai.assistant) {
    return 'FALLBACK_TO_CLOUD'; // Not supported or enabled
  }

  const session = await window.ai.assistant.create({
    systemPrompt: "You are an agent intent router. Classify the user query into one of: 'SEARCH_RECIPES', 'OPEN_WORKSHOP', 'VIEW_GARDEN', or 'UNKNOWN'."
  });

  const response = await session.prompt(userInput);
  session.destroy(); // Free up GPU/RAM memory immediately
  return response.trim();
}

Running Vector Databases on the Edge

For retrieval-augmented generation (RAG) to run client-side, we must store and query embeddings within the browser.

Using DuckDB-Wasm or light Wasm-compiled libraries (such as Hierarchical Navigable Small World graphs), we can maintain a fast, offline vector search engine. The database runs in a background Web Worker to avoid blocking the main rendering thread.

TS.SNIPPET

// Inside vector-worker.ts
import { HNSW } from 'hnsw-wasm';

let index: HNSW;

self.onmessage = async (e) => {
  if (e.data.type === 'INIT_INDEX') {
    index = new HNSW(384); // 384-dimension vectors (e.g., MiniLM embeddings)
  } else if (e.data.type === 'SEARCH') {
    const results = index.search(e.data.queryVector, e.data.topK);
    self.postMessage({ type: 'RESULTS', results });
  }
};

Latency and Performance Benchmarks

In our experiments running local intent routing and edge vector lookups on standard consumer laptops, the results highlight a massive leap in responsiveness:

Metric	Cloud-Only Loop (Gemini Flash)	Hybrid Edge-First Loop (Nano + Wasm)	Improvement
Intent Routing Latency	~350ms - 600ms	8ms - 15ms	~40x Faster
Vector DB Search	~120ms (Cloud DB)	4ms (Local Wasm)	~30x Faster
API Token Cost	$0.00015 / query	$0.00000	100% Cost Savings
Offline Support	None (Fails)	Fully Functional	Reliability Gain

Designing Hybrid Orchestration Loops

While window.ai handles classification and structural summaries perfectly, it lacks the depth of heavy cloud models for complex reasoning.

A high-performance agent loop uses a tiered execution model:

Tier 1 (Edge): Local classification and routing via window.ai. Local context lookup from Wasm database.
Tier 2 (Cloud): If the confidence score or complexity exceeds local limits, escalate the payload to Cloud models (e.g., Gemini Pro).

By shifting the cognitive load and state queries to the edge, we unlock agent loops that respond instantly, work offline, and run with zero infrastructure hosting cost.

Zero-Cost Intent Routing: Benchmarking window.ai & Edge Vector DBs in Agent Loops

The Built-in AI Advantage: window.ai

Running Vector Databases on the Edge

Latency and Performance Benchmarks

Designing Hybrid Orchestration Loops

Related Research

Wasm-Powered Canvas: High-Performance Simulations with WebAssembly

The Agent Loop: Engineering the Cognitive Heartbeat

Read more articles