Agentic RAG for voice: the latency budget you didn't know you had.

There's a number every voice engineer learns to fear: 800ms. Past that, the caller starts to feel the pause. The agent stops sounding like a human and starts sounding like a kiosk.

When we first wired retrieval into Tone, our naive pipeline blew that budget on a single lookup. Here's what we changed.

The naive pipeline

The textbook RAG pattern looks like this on a voice turn:

Transcribe the caller's utterance.
Embed the transcript.
Search the vector store.
Stuff the top-k chunks into the LLM prompt.
Generate the response.
Synthesize speech.

Every step is sequential. Every step blocks the next.

What we actually want

The model already decides whether to call tools mid-response. So instead of wrapping retrieval around the model, we made retrieval a tool the model calls when it decides it needs to:

agent.registerTool({
  name: 'lookup_knowledge',
  description: 'Retrieve company policy and product docs.',
  schema: z.object({ query: z.string() }),
  handler: async ({ query }) => {
    return await knowledgeBase.search(query, { k: 3 });
  },
});

This sounds small. It isn't.

Why this saves 400ms

Two reasons. First, the model often doesn't need retrieval at all — for greetings, confirmations, and clarifications, retrieval is wasted work. The tool-call pattern skips it entirely on those turns.

Second, when retrieval is needed, the model has already started speaking by the time the tool call resolves. The LLM streams the preamble ("Let me check that for you…") while retrieval runs in parallel. By the time the answer matters, it's there.

What you give up

A lot of the prompt-engineering folklore around RAG ("always retrieve, always ground") doesn't survive contact with this pattern. The model has to be good enough to know when it needs help. In our evals, GPT-4.1 and Claude 3.5 cross that bar easily. Smaller models do not.

Pick your model accordingly.