Architecture

How Vomyra Works

Every Myra call runs the same four-step pipeline: transcribe the caller's speech, decide what to say, optionally call a tool, then speak the reply. The end-to-end target is under 700ms � fast enough to feel like a real conversation.

End-to-end latency target

< 700ms response latency

From end of caller speech to first audio byte

This is the time from when the caller stops speaking to when Myra begins replying. Human conversations feel natural at under 700ms. Vomyra targets this through streaming at every stage � transcription, LLM output, and TTS are all streamed in parallel rather than sequentially, so the pipeline starts the next stage as soon as the first tokens arrive.

STT

80�150ms

LLM

300�500ms

Tool (if used)

Variable

TTS

100�200ms

The pipeline, step by step

1. Listen � Speech to Text

Typical latency: ~80�150ms

Transcriber

When the caller speaks, their audio is streamed to the transcription provider in near real time. Vomyra uses Azure, Deepgram, or others depending on your configuration. The transcriber converts spoken words to text and handles background noise, accents, and code-switching between Hindi and English.

Azure is recommended for Hindi, Tamil, Telugu, and other Indian languages � it is tuned for Indian accents and outperforms generic models on noisy mobile calls.

Speech-to-Text Speech configuration Custom keywords

2. Understand � Language Model Inference

Typical latency: ~300�500ms

LLM

The transcript is appended to the conversation history and, together with the system prompt, sent to the language model (OpenAI, Meta, Vomyra, or a custom endpoint). The model decides what Myra should say next and whether to call a tool. Faster, lighter models reduce this step; more capable models handle complex, ambiguous conversations better.

The system prompt is Myra's job description. A specific, well-written prompt on a mid-tier model beats a vague prompt on the best model every time.

Language Models System Prompt Custom LLMs

3. Act � Tool Execution (optional)

Typical latency: Variable � depends on your endpoint

Tools

If the model decides a tool should run, Vomyra calls your endpoint (or a built-in integration) with the parameters Myra collected from the caller. The tool response is returned to the model, which incorporates it into the next reply. Tool calls are the main driver of end-to-end latency variance � fast endpoints keep conversations fluid.

Design tool endpoints specifically for voice: return only the fields Myra needs to speak, keep responses under 200ms, and always provide a rejection plan for failures.

Tools overview Custom tools Tool rejection plan

4. Speak � Text to Speech

Typical latency: ~100�200ms

TTS

The model's text reply is sent to the text-to-speech provider, which synthesises speech in real time and streams audio back to the caller. Voice selection and provider affect both quality and latency. Vomyra AI and Azure voices are optimised for Indian languages.

Keep replies short in the system prompt � one or two sentences per turn. Short TTS requests have lower latency and callers are less likely to interrupt.

Text-to-Speech Custom voices Voice pipeline

What you configure vs. what Vomyra manages

You configure

Language and voice
First message (opening line)
System prompt (Myra's job)
Tools and integrations
Response delay and denoising
Model and transcriber provider

Vomyra manages

Real-time audio streaming
Transcription pipeline routing
LLM inference and streaming
Turn detection and interruption
TTS streaming to caller
Infrastructure, scaling, uptime

Dive deeper

Speech-to-Text

Transcriber selection for Indian languages and noisy call environments.

Language Models

Compare providers and understand how the system prompt drives behaviour.

Text-to-Speech

Voice selection, quality, and latency for spoken responses.

Tools

How Myra acts on your systems mid-call without blocking the conversation.

End-to-end latency target

< 700ms response latency

From end of caller speech to first audio byte

STT

80�150ms

LLM

300�500ms

Tool (if used)

Variable

TTS

100�200ms

The pipeline, step by step

1. Listen � Speech to Text

Typical latency: ~80�150ms

Transcriber

Azure is recommended for Hindi, Tamil, Telugu, and other Indian languages � it is tuned for Indian accents and outperforms generic models on noisy mobile calls.

Speech-to-Text Speech configuration Custom keywords

2. Understand � Language Model Inference

Typical latency: ~300�500ms

LLM

The system prompt is Myra's job description. A specific, well-written prompt on a mid-tier model beats a vague prompt on the best model every time.

Language Models System Prompt Custom LLMs

3. Act � Tool Execution (optional)

Typical latency: Variable � depends on your endpoint

Tools

Design tool endpoints specifically for voice: return only the fields Myra needs to speak, keep responses under 200ms, and always provide a rejection plan for failures.

Tools overview Custom tools Tool rejection plan

4. Speak � Text to Speech

Typical latency: ~100�200ms

TTS

Keep replies short in the system prompt � one or two sentences per turn. Short TTS requests have lower latency and callers are less likely to interrupt.

Text-to-Speech Custom voices Voice pipeline

What you configure vs. what Vomyra manages

You configure

Language and voice
First message (opening line)
System prompt (Myra's job)
Tools and integrations
Response delay and denoising
Model and transcriber provider

Vomyra manages

Real-time audio streaming
Transcription pipeline routing
LLM inference and streaming
Turn detection and interruption
TTS streaming to caller
Infrastructure, scaling, uptime