Research · Voice of IndiaApril 2026 · 16 min read

Building Voice of India:A national benchmark for Indic ASR

What it actually takes to evaluate speech recognition for India — across 15 languages, 139 regional clusters, 36,691 speakers, and a scoring protocol that finally treats orthographic variation as variation rather than error.

Languages

139

Regional Clusters

36,691

Speakers

536h

Audio · 306k utterances

Why we built this

India does not speak the way ASR benchmarks assume. We interrupt ourselves. We switch languages mid-thought. We say "doctor" in the middle of a Hindi sentence and spell it in Devanagari one day, in Latin script the next. We pronounce numbers three different ways depending on register. And almost none of this looks like the clean, scripted, single-speaker English audio that most speech recognition systems are measured against.

For a long time, the Indic ASR field has had a quiet credibility problem. Models post single-digit Word Error Rates on academic test sets, then stumble the moment they meet real callers in real districts. The gap between leaderboard accuracy and production accuracy is not an engineering bug. It is a measurement bug. The benchmark itself is wrong.

Voice of India is our attempt to fix the measurement. It is a national, research-led benchmark covering 15 Indic languages and 1,200+ districts, built in partnership with AI4Bharat and IIT Madras, and scored using a new lattice-based protocol called OI-WER (Orthographically Informed Word Error Rate) that gives credit for linguistically valid variations instead of penalizing them.

This blog is the long version of how we did it: the data collection pipeline, the cluster-based sampling strategy, the six-stage transcription audit, the lattice protocol, and what the leaderboard actually tells us about the state of speech recognition for Indian languages today.

The problem with WER and CER for Indic languages

WER (Word Error Rate) measures the fraction of words in the reference transcript that the ASR system got wrong, computed as (substitutions + insertions + deletions) / total reference words. CER (Character Error Rate) applies the same formula at the character level instead of the word level, which makes it more forgiving of spelling and morphological variation — a useful trait for Indic scripts.

Word Error Rate was designed for English. It assumes every spoken word has one accepted spelling, that languages don't mix mid-sentence, and that the gap between formal and colloquial usage is narrow. Indian languages don't satisfy any of those assumptions. The result is that WER systematically misreads correct transcriptions as errors. Six failure modes show up over and over.

1. Code-mixing and the script-mismatch trap

A native Hindi speaker says "वह doctor के पास गया". The reference annotator kept doctor in Latin script. The ASR system transliterated it to डॉक्टर. Both are correct. Standard WER counts a substitution.

Hindi · code-mixed

Reference: वह doctor के पास गया
ASR output: वह डॉक्टर के पास गया
Standard WER: 20% — flags 'doctor' as a substitution
Linguistic reality: Identical meaning. Both spellings are correct.

2. Colloquial vs formal registers

Every Indian language has a formal written register and a colloquial spoken register, both equally understood by native speakers. WER treats any deviation from the reference as an error.

Hindi · colloquial vs formal

Reference (formal): वे एक साथ कार्य कर रहे हैं
ASR output (colloquial): वो एक साथ कार्य कर रहे हैँ
Standard WER: 33% — 2 of 6 words flagged as errors
Linguistic reality: A native Hindi speaker understands both transcriptions as the same — the meaning is identical.

3. Numeric format variability

The same number can be written three valid ways in an Indic language: as words (पांच सौ), as Arabic numerals (500), or in Devanagari numerals (५००). WER treats all three as unrelated tokens.

Hindi · numeric format variability

Reference: कुल पांच सौ छात्र उपस्थित थे
ASR output: कुल 500 छात्र उपस्थित थे
Standard WER: 33% — 2 of 6 words flagged as errors
Linguistic reality: A native Hindi speaker reads both as identical — the number five hundred is conveyed correctly.

4. Orthographic variation: the Nukta problem

Loanwords from Persian and Arabic are routinely written with or without a Nukta (ज़ vs ज) depending on regional convention and publisher style. Both are correct. WER doesn't know that.

Hindi · Nukta variation

Reference: उसने ज़रूर खाना खाया
ASR output: उसने जरूर खाना खाया
Standard WER: 25% — 1 of 4 words flagged as error
Linguistic reality: A native Hindi speaker reads both as identical — the Nukta is optional and the meaning is unchanged.

OI-WER: the lattice-based scoring protocol

The core metric powering Voice of India is Orthographically Informed Word Error Rate (OI-WER). The idea is simple: instead of comparing the ASR output to a single reference string, we compare it to a lattice of linguistically valid alternatives for that utterance.

Each audio segment in Voice of India is mapped to a comprehensive set of valid lexical and phonetic variations: alternate spellings, code-mix variants, formal vs colloquial registers, common loanword choices, Nukta-on/off forms, and equivalent numeric renderings. OI-WER then computes the optimal path through that lattice — the single reference variant that minimizes edit distance against the model's output. If the model's transcription matches any linguistically valid variant, that segment is not penalized.

VoI #22068 · Hindi · 2s

"अमिताभ बच्चन" — Amitabh Bachchan

Lattice

variations

The lattice · 6 × 8 = 48

all linguistically valid

word 16 variants

अमिताभAmitabhamitabhAMITABHAmitabamitab

word 28 variants

बच्चनBachchanbachchanBACHCHANBachanbachanBacchanbacchan

Model outputs · 12 systems

✓ 6/12Sarvam · Saarika · Gemini Pro/Flash · IndicConformer · MS STT · Amazon0% OI-WER

✗ offGPT-4o Transcribeअमिता बचनClose phonetic spelling but not in lattice

✗ offGPT-4o-mini Transcribeहम ता बच्चाHallucinated 'बच्चा' (child) — not the surname

✗ offOmniASR LLM 7B v2अम्म इधर आप अच्छे हैंHallucinated entire phrase ('are you well over here')

✗ offWhisper Large v3 TurboI m not a centEnglish hallucination

✗ offGrokI am David ChenHallucinated English persona ('I am David Chen')

WERAmitabh Bachchan → 100% · अमिताभ Bachan → 50%

OI-WERboth → 0% · only true hallucinations penalised

The lattice is not generated by an LLM judge at evaluation time. It is constructed offline, ahead of evaluation, by native-speaking linguists during a separate pipeline that runs alongside the primary six-stage transcription. This matters: it means OI-WER is fully deterministic, fully reproducible, and free of the model-drift problems that plague LLM-as-judge metrics.

The reference implementation is open source: AI4Bharat/OIWER on GitHub.

Data collection: dual-channel, unscripted, real

Voice of India does not use scripted prompts. The core dataset is built from dual-channel, unscripted telephonic conversations recorded across India. Dual-channel preserves acoustic isolation between speakers, allowing precise attribution. Unscripted preserves natural disfluencies, overlaps, code-switches, and the prosody of real Indic speech.

Most public Indic ASR benchmarks rely on read speech: speakers reading prepared sentences in quiet rooms into a clean microphone. That setting eliminates almost everything that makes real production audio hard — overlapping turns, interruptions, background TV, vehicle noise, mid-sentence code-switches into English, hesitations, laughter, and the long-tail prosody of unrehearsed speech. Models that score well on read speech routinely collapse on conversational telephony, which is exactly the channel where most Indian users actually interact with voice systems (customer support, field operations, IVR, voice assistants on entry-level devices).

We deliberately recorded over a narrowband telephonic channel to match this production reality. Each speaker is captured on an isolated channel, which means we can run speaker-attributed transcription and measure cross-talk cleanly. Conversations are recorded across 28 states and 8 union territories, spanning urban metros, tier-2 and tier-3 cities, and rural districts, so that accent, dialect, and channel conditions reflect the actual geography of Indian speech rather than a studio-controlled subset.

Raw conversations are programmatically segmented into utterances based on detected silences and speaker-turn boundaries. Every segment then passes a three-tier filtering protocol before it is considered for the benchmark:

Acoustic integrity. Signal-to-Noise Ratio thresholds ensure the speech is discernable while remaining challenging. We retain a small share of noisy audio (P808 ≤ 2.8) — typically 3–7% per language — because production systems have to handle it. Pure silence, DTMF tones, ringback, and hold music are filtered out automatically.
Cluster eligibility. The segment must come from a speaker registered against one of our 139 regional clusters, with verified demographic metadata (age band, gender, district). This guarantees that every utterance in the benchmark can be traced back to a known geographic and demographic slice, enabling fairness analysis across age, gender, and region.
Linguistic relevance. Segments are screened for primary-language content; pure-English segments and unintelligible audio are dropped before transcription. Natural intra-sentential code-switching (Hindi–English, Tamil–English, etc.) is preserved, because it is a defining feature of conversational Indic speech and a known failure mode for monolingual models.

The result is a corpus that looks much closer to what a deployed model actually encounters: short, overlapping, noisy, demographically diverse utterances drawn from real conversations rather than read scripts.

The six-stage gold-standard transcription pipeline

For every selected audio segment we run a machine-assisted, six-pass transcription and audit pipeline. The goal is near-zero ground-truth error.

Hypothesis generation. A primary transcription is produced by an ensemble of internal high-recall ASR models tuned for each of the 15 languages.
First linguist pass. A native-speaking linguist audits the hypothesis against the audio for absolute correctness.
Independent relay audit (round 1). A different linguist reviews the audio against the previous transcript without seeing prior reviewer notes.
Independent relay audit (round 2). A third linguist repeats the process. Any disagreement triggers a consensus discussion.
Variation extraction. A separate pipeline generates the lattice of valid lexical/phonetic variations for that utterance.
Lattice validation. Native-speaker reviewers approve, prune, or extend the lattice before the segment enters the benchmark.

Each segment passes through at least three independent native-speaker reviewers before it is considered gold standard. This is expensive. It is the reason the benchmark exists.

Cluster-based sampling: representing how India actually speaks

A common failure mode in Indic benchmarks is to treat every region as equal. Voice of India does the opposite. We use Population-Proportional Stratified Sampling across 139 regional clusters defined to capture distinct phonetic and lexical signatures. The volume of data sampled from each cluster is determined by the actual population percentage of that region — so the dataset's demographic weight matches the country's.

That single choice changes everything. It means a Telugu speaker from coastal Andhra and a Telugu speaker from Telangana are both represented in proportion to their real populations, instead of being averaged out into a single "Telugu" bucket. It means Hindi covers 306 districts, not three studio cities. And it means the score a model gets is closer to the score it would earn in production.

Why geographic representation actually matters

A single headline number like "Tamil WER = 20%" is one of the most misleading artifacts in Indic ASR. It collapses a country-sized distribution into a single point estimate and hides the only signal that matters for production: where the model is failing. The same model reported at 20% WER on Tamil might in reality be operating at 10% WER in Chennai and central Tamil Nadu — fluently handling the dialect it has seen most during training — while degrading to 60–80% WER in Tirunelveli, Kanyakumari, or among Sri Lankan Tamil speakers, where phonetic cadence, vowel length, and loanword inventory shift sharply. The 20% average is real, but no user actually experiences 20%. They experience either the 10% version of the model or the 80% one, depending entirely on where they live.

Population-proportional clustering is what makes this visible. By breaking each language into its real geographic and demographic strata — and weighting the dataset so each cluster contributes in proportion to its actual speaker population — we can decompose a single WER number into a per-cluster distribution. Companies deploying voice products can then see exactly which districts, age bands, or gender cohorts are being underserved, prioritize targeted data collection or fine-tuning for those clusters, and track whether quality improvements are uniform across the country or concentrated in already-strong regions. The goal is not just to lower the headline number, but to flatten the variance — so a user in Bastar gets the same product quality as a user in Bengaluru.

Per-language dataset breakdown

Language	Utterances	Duration (hrs)	Speakers	Districts	M : F	Unique Words
Assamese	13,849	23.4	390	29	60 : 40	23,792
Bengali	24,234	50.4	6,014	54	55 : 45	41,333
Bhojpuri	12,742	23.0	390	99	56 : 44	21,067
Chattisgarhi	11,508	20.8	349	67	41 : 59	23,890
Gujarati	29,105	44.8	4,634	31	56 : 44	43,006
Hindi	22,060	46.9	4,270	306	57 : 43	26,278
Kannada	12,671	24.9	320	36	48 : 52	45,642
Maithili	14,139	22.1	658	36	54 : 46	26,855
Malayalam	20,969	45.1	1,090	43	44 : 57	82,996
Marathi	24,442	48.6	4,501	34	53 : 47	46,280
Odia	15,972	22.4	390	40	61 : 39	24,609
Punjabi	17,998	32.5	1,030	58	57 : 43	32,300
Tamil	32,623	53.4	4,434	37	47 : 53	80,426
Telugu	39,223	51.7	7,833	44	55 : 45	74,199
Urdu	14,709	26.2	388	119	54 : 46	17,228

536 hours of curated audio across 306,230 utterances and 36,691 unique speakers. Hindi spans 306 districts; Bhojpuri and Urdu pull from 99 and 119 respectively.

The shape of the dataset matters as much as its size. Telugu and Bengali concentrate volume into deep speaker pools; Hindi spreads thin across 306 districts to surface regional variation; Urdu and Bhojpuri intentionally over-index on district coverage relative to their hours, because dialectal drift in those languages is the entire point of measurement.

The collection pipeline, end to end

Eliciting spontaneous Indic speech at scale is harder than collecting scripted prompts. To draw out natural, extended speech, we curated a repository of 1,000+ conversational topics per language — everyday life, finance, healthcare, agriculture, education, travel — generated initially by Gemini 3 Pro and then audited by native language experts for cultural fit. Each topic surfaces as an open-ended cue followed by progressively revealed follow-up questions that pull speakers toward richer descriptions.

Audio is captured over internal infrastructure that preserves dual-channel audio (speaker isolation), with WebRTC VAD performing initial segmentation. Adjacent speech regions are merged across short silences and length-filtered. Automated language identification (Meta MMS) flags off-language segments before they reach human review. Every contributor is screened for language familiarity before being granted recording access; all participants provide informed consent under an institute-approved ethics protocol.

Public vs private split: protecting the benchmark

For Voice of India to remain a useful long-term measurement, it has to stay uncontaminated. Once a benchmark leaks into training pipelines, its scores stop measuring generalization and start measuring memorization. The history of NLP benchmarks is largely the history of this exact failure.

So the core evaluation set is held private. No model provider sees the held-out audio or the lattice references.

We will release a public sample set for each of the 15 languages, covering the full pipeline — the audio, the gold-standard transcript, the lattice variants, and per-model outputs. The public set is large enough to debug a model and understand the lattice protocol, and small enough that it cannot meaningfully contaminate training.

The leaderboard

We evaluated 16 models against Voice of India: open-source releases from Meta, AI4Bharat, and Google, alongside proprietary APIs from OpenAI, Google, Microsoft, AWS, Deepgram, AssemblyAI, ElevenLabs, and Sarvam. The aggregate ranking is a corpus-level OI-WER computed across languages, pooled by reference word count — so models can't game the score by performing well only on smaller subsets.

Table · Per-language results

OI-WER (%) of all 16 models across 15 Indian languages on Voice of India

Model	as	bn	bho	hne	gu	hi	ka	mai	ml	mr	or	pa	ta	te	ur
ElevenLabs Scribe v2	15.6	10.0	23.5	20.3	21.2	7.7	19.3	–	23.0	12.9	20.7	15.6	20.4	23.0	25.5
Amazon Transcribe	–	9.1	36.0	32.9	17.7	6.8	18.6	–	28.2	11.3	17.9	15.9	19.3	19.7	–
AssemblyAI Universal	104.8	103.8	46.1	43.6	101.8	19.3	89.0	–	107.5	87.6	–	101.0	57.4	105.0	31.9
Deepgram Nova 3	–	28.9	45.8	42.4	–	13.0	53.7	–	–	43.7	–	–	67.8	43.1	–
Gemini 3 Pro	20.1	8.5	18.4	17.2	15.8	6.0	19.9	25.6	21.7	10.7	20.9	14.4	15.7	21.9	9.1
Gemini 3 Flash	26.9	12.6	22.6	23.9	22.5	8.3	22.2	30.8	27.1	16.0	26.1	19.4	19.9	27.9	11.9
GPT-4o Transcribe	94.7	44.8	49.0	45.3	98.2	34.0	84.2	60.5	97.0	55.6	72.5	70.1	64.2	69.3	35.4
GPT-4o Mini Transcribe	37.6	21.1	49.1	44.6	295.9	19.6	97.5	45.6	167.8	30.7	42.1	37.9	51.9	81.2	52.0
IndicConformer	14.3	10.7	–	–	18.0	8.2	21.4	24.7	25.9	13.1	14.4	14.9	19.9	23.7	8.1
Microsoft Speech-to-Text	–	25.4	38.1	34.5	–	11.4	–	–	40.9	31.9	–	–	28.0	–	25.2
OmniASR LLM 1B	29.2	29.7	32.8	27.3	38.9	14.9	45.7	47.7	58.4	31.2	89.8	36.5	49.0	57.3	17.2
OmniASR LLM 7B	25.3	22.8	31.4	26.3	34.1	13.7	39.2	48.2	52.0	26.3	72.6	33.3	43.1	50.7	16.0
Sarvam Audio	12.7	6.0	20.9	17.6	12.8	5.0	16.3	24.8	18.9	9.4	14.0	11.2	14.2	18.2	7.0
Saarika 2.5	–	8.1	29.6	26.0	14.0	6.2	16.4	–	18.9	10.0	15.1	12.4	14.9	18.9	–
Gemma E4B	46.9	21.5	31.4	29.3	30.0	11.2	34.8	–	47.6	26.6	46.4	25.3	40.4	43.8	15.8

Cells are colour-graded per column: green = best in language, red = worst. Dashes mark languages a model does not natively support — we never extrapolate. Sarvam Audio and Saarika 2.5 hold most green cells; AssemblyAI and GPT-4o sit in deep red across the board. Hindi (hi) is the only language where almost every system clears 20%, and Malayalam (ml) is where the biggest gaps open up.

Figure

OI-WER by model — Hindi

Hindi has compressed to a sub-10% race. South-Indian languages (Tamil, Malayalam, Telugu) sit 2-4× higher across the board — a data-representation ceiling, not a model-capacity one.

Drilling per-language shows where the leaderboard's averages mislead. Hindi is the most competitive language, with multiple systems under 8% OI-WER and the best near the noise floor of human inter-annotator agreement. Tamil, Malayalam, Telugu, and Kannada remain hard for almost everyone — agglutinative morphology and dialectal density push even strong proprietary systems into the high teens. Sarvam's models lead on most languages; Gemini 3 Pro is the strongest non-Indic- specialised system; open-source IndicConformer is competitive with proprietary APIs on Hindi, Bengali, and Urdu.

Model coverage is not uniform. Several proprietary systems do not support Assamese, Maithili, Odia, Punjabi, or Gujarati natively. We do not extrapolate scores for unsupported languages — a missing cell on the leaderboard is a missing capability, not a low score.

Why public benchmarks overstate ASR readiness

Public benchmarks like FLEURS are widely used to report Indic ASR performance, but their static, publicly accessible evaluation sets are vulnerable to overfitting. Models tuned to the leaderboard exploit benchmark artifacts; their scores stop measuring generalization. Voice of India, scored on held-out telephonic audio, exposes the gap.

Figure

Figure 3 — WER of six models across FLEURS and Voice of India

Public benchmarks overstate readiness. GPT-4o Transcribe ranks 5th on FLEURS (9.1%) but jumps to ~34% on Voice of India OI-WER and 40.3% on single-reference Voice of India Ground Truth — a re-ordering that exposes FLEURS overfitting. Sarvam Audio, undervalued by FLEURS, takes rank #1 on real-world Indian telephony. OI-WER recovers ~2–6 points of legitimate orthographic variation that strict single-reference Ground Truth WER penalises. Circled badges show rank at each stage (1 = lowest WER).

Table · Benchmark gap · Tamil

The same six models, scored on FLEURS vs Voice of India — Tamil

Model	FLEURS WER	FLEURS Rank	VoI OI-WER	VoI Rank	Δ WER	Rank Change
Sarvam Audio	14.2%	#4	14.2%	#1	+-0.0pp	▲ up 3
Gemini 3 Pro	11.4%	#1	15.7%	#3	+4.3pp	▼ down 2
Gemini 3 Flash	12.1%	#2	19.9%	#6	+7.8pp	▼ down 4
11Labs Scribe v2	13.8%	#3	20.4%	#7	+6.6pp	▼ down 4
GPT-4o Transcribe	18.6%	#5	64.2%	#8	+45.6pp	▼ down 3
Deepgram Nova 3	24.3%	#6	67.8%	#9	+43.5pp	▼ down 3

Tamil only — aggregate FLEURS vs VoI comparisons are not meaningful because FLEURS hours and coverage vary sharply across Indic languages. Even on a single language, every model gets worse on Voice of India and the order changes: Sarvam Audio, undervalued by FLEURS at #4, emerges as the clear leader on real Tamil telephony. A public benchmark cannot reliably tell you which Indic ASR system to deploy.

Table · Benchmark properties

Why Voice of India is a stricter — and more honest — measurement than FLEURS

Dimension	FLEURS (public)	Voice of India
Audio source	Read prompts from Wikipedia, studio-quality	Unscripted telephonic conversations on real mobile networks
Speaking style	Scripted, single-speaker, no disfluencies	Spontaneous, code-switched, natural disfluencies and overlaps
Geographic coverage	No district-level annotation	139 regional clusters · 675 districts · population-proportional sampling
Reference transcripts	Single reference per utterance	Lattice of valid orthographic / code-mix / register variants
Set visibility	Fully public — vulnerable to training contamination	Private held-out core set + small public sample
Language coverage	Some Indic languages, uneven hours	15 Indic languages, 536 hours, 36,691 speakers

FLEURS was a useful baseline for cross-lingual coverage in 2022. For Indian-language production deployments in 2026, every dimension above matters more than aggregate WER — and FLEURS only reports aggregate WER on a public set everyone has trained against.

The reordering matters as much as the absolute jump. On FLEURS, Deepgram Nova 3 (16.7%) and GPT-4o Transcribe (9.1%) appear competitive. On VoI, both fall to the bottom of the leaderboard while Sarvam Audio — undervalued by FLEURS at 8.2% — emerges as the actual leader at 13.0% lattice WER. Single-reference WER on the same audio inflates errors further (15.4% for Sarvam, 40.3% for GPT-4o), confirming that a meaningful slice of the “errors” older metrics flag are valid orthographic variations, not transcription failures.

Where ASR breaks: audio quality and speaking rate

Aggregate WER hides the conditions under which models actually fail. We sliced the evaluation set across DNSMOS quality quartiles and speaking-rate buckets to see what production teams should actually expect.

Figure

WER vs audio quality (DNSMOS quartiles)

Across every system, error rate drops monotonically as recording quality improves. The Q1→Q4 gap is roughly 2× — production deployments should expect their worst-quality decile to dominate aggregate error.

Across every system, error rate falls monotonically as recording quality improves — with roughly a 2× gap between the worst quartile (Q1) and the best (Q4). For production deployments where audio quality is uncontrolled, the worst-quality decile of calls will dominate aggregate WER.

Figure

WER vs speaking rate (words per second)

A clear U-curve: very slow speech (often hesitant, disfluent) and very fast speech both increase error rates. The optimum sits in the middle of the 'normal' band — a 4-5 wps sweet spot for telephonic Indic audio.

Speaking rate produces a clean U-curve. Both very slow speech — typically hesitant, disfluent, or pause-heavy — and very fast speech increase error rates noticeably. The optimum sits inside the “normal” band at roughly 4-5 words per second. Anything outside that window costs accuracy, regardless of vendor.

Audio duration shows a similar shape: very short (<2s) clips lack acoustic context for disambiguation, and very long (>5s) clips amplify cumulative drift. The 2-5 second sweet spot mirrors what production VAD pipelines tend to produce anyway.

Geographic disparity: a 4% to 44% spread

Reporting one number per language averages out the most important signal: where in India a system fails. We computed average WER per district across four pan-India models supporting all fifteen languages.

Figure

District-level WER spans 4% to 44%

Illustrative district-level averages across four pan-India models, drawn from the paper's regional aggregates. Hindi-belt metros cluster under 10%; the worst errors come from Bhojpuri-speaking eastern UP and Bihar districts (Gorakhpur, Chhapra) — Hindi-dominant regions whose dialect tail is severely underrepresented in training data.

District-level WER ranges from ~4% (Nainital, in the Himalayan Hindi belt) to ~44% (Mannarakkat in interior Kerala). The Hindi belt — Uttar Pradesh, Delhi, Haryana, Rajasthan, Madhya Pradesh — clusters under 10% WER. Metropolitan districts also tend to perform well. Linguistically diverse or underrepresented regions, especially North Bihar (Maithili, Bhojpuri) and interior Kerala (Malayalam), exhibit substantially higher error rates. The pattern is a clear geographic bias and a direct consequence of training-data distribution, not modeling capacity.

For out-of-region migrants, the picture is worse. A Chattisgarhi speaker calling from Tamil Nadu — a real, common production scenario — sees WERs of 55-65% across most systems. ASR robustness is not a property of the language; it is a property of the speaker-and-region pair.

The 19-21% male-speaker penalty

One of the most striking findings in the per-attribute analysis is a consistent gender gap. Across architectures, vendors, and languages, male speakers see roughly 20% relative WER inflation versus female speakers on the same content.

Figure

A consistent 19–21% male-speaker penalty

Across architectures and vendors, male speakers see ~20% relative WER inflation versus female speakers on identical content — likely a downstream effect of female-skewed training corpora in Indian language pretraining sets.

The effect is too consistent to be coincidence and too large to ignore. The most likely cause is upstream: female-skewed Indic training corpora over the last several years have produced acoustic models that fit female pitch ranges and articulation patterns more tightly. Closing this gap requires deliberate balancing in pretraining audio, not architectural changes.

What the numbers say

Five patterns are worth flagging:

The Hindi gap is closing fast — the South is not. Best-in-class OI-WER on Hindi is now under 5%, near the noise floor of human inter-annotator agreement for spontaneous telephonic speech. The same trajectory is not visible in Tamil, Malayalam, Telugu, or Kannada, where OI-WER stays 2–4× higher across almost every system and every architecture. The South-Indian ceiling is structural, not a tuning problem.
Indian ASR is not "solved." The popular narrative — driven by single-digit WER on a handful of public leaderboards — overstates how usable these models actually are in production. The same systems that look saturated on public benchmarks degrade sharply on unseen speakers, unseen districts, and natural code-switched speech. Companies betting deployment decisions on those leaderboards are buying a number, not a capability.
Public benchmarks do not reflect real-world deployment. Most open evaluation sets have been around long enough to have leaked into pretraining corpora, been overfit on, or been tuned against. They cover a narrow slice of speakers, channels, and domains, and they do not surface the edge cases — telephony noise, accent drift, named entities, numerals, code-switches — that dominate failures in the field. A model that wins a public benchmark and a model that survives a call-center floor are not the same model.
After a point, geography is the lever — not architecture. Once a model is broadly pretrained on a language, further gains stop coming from bigger encoders or fancier objectives. They come from going to specific districts, recording the dialect that actually lives there, and folding that audio back into training. District-level WER (Chhapra at 44%, Gorakhpur at 38.5%, vs. Nainital at 4%) is the signal that tells deployment teams where the next percentage point of real-world quality has to come from.
Single-reference scoring unfairly penalizes Indian languages.Standard WER assumes one canonical transcript. Indian languages routinely admit multiple equally valid spellings, sandhi joins, transliterations, and morphological variants for the same utterance. Scoring against one reference treats legitimate variation as error. A closed set of accepted lattice variations per utterance — the approach OI-WER takes — is what makes a benchmark honest about Indian speech instead of artificially harsh on it.

What's next

Voice of India is a living benchmark. The roadmap focuses on the failure modes that still hide inside aggregate OI-WER:

Named-entity coverage. Dedicated evaluation of place names, person names, and organization names — the parts of a transcript that carry the most downstream weight in production systems.
Domain-specific evaluation sets. Healthcare, finance, education, governance, and public-services audio, scored separately so deployment teams can see ASR readiness in their actual domain.
Numerical and semiotic segmentation. Finer-grained scoring for dates, currencies, measurements, and percentages — the categories where ASR failures most often translate into business failures.
Expanded language coverage. More languages, more dialect clusters, and longer-form audio.

Acknowledgements

Voice of India exists because of the native-speaking transcribers, reviewers, and linguists across all 15 languages who did the slow, careful work that no automated pipeline can replace. It exists because of AI4Bharat and IIT Madras, who lent decades of Indic NLP research to the design of OI-WER. And it exists because we think the conversation about Indian-language AI deserves a measurement that India can recognize as fair.

To request an evaluation against the private set, write to voi-evals@joshtalks.com.