VoiceBench Leaderboard — Interactive Voice Assistant Rankings

All Models

Full leaderboard

#1: NVIDIA Nemotron 3 Nano Omni 30B A3B (89.4)

Cascaded

Separate ASR + LLM pipeline

#1: Whisper + GPT-4o (87.8)

Audio-LLM

Audio encoder + LLM, text output

#1: Ultravox-GLM-4P7 (88.9)

Vision+Audio+LLM

Vision & audio encoders + LLM, text output

#1: NVIDIA Nemotron 3 Nano Omni 30B A3B (89.4)

Omni

Speech-in & speech-out, turn-based

#1: GPT-4o-Audio (86.8)

S2S / Full-Duplex

Simultaneous listen & speak

#1: Nemotron 3 VoiceChat (V1) (58.1)

Weights: All (41) Open (38) Closed (3)

Cascaded

Audio-LLM

Vision+Audio+LLM

Omni

S2S / Full-Duplex

Open

Closed

#	Model	AlpacaEval	CommonEval	WildVoice	SD-QA	MMSU	OBQA	BBH	IFEval	AdvBench	Overall	Cat. Rank
1	NVIDIA Nemotron 3 Nano Omni 30B A3BVision+Audio+LLMOpen	4.75	4.57	4.58	71.43	82.30	92.97	91.10	88.66	100.00	89.39	—
2	Ultravox-GLM-4P7Audio-LLMOpen	4.87	4.30	4.55	84.15	83.82	94.72	87.16	76.26	99.23	88.86	—
3	Ultravox-GLM-4P7 (thinking)Audio-LLMOpen	4.69	4.05	4.21	88.57	87.15	95.01	89.78	80.99	98.46	88.79	—
4	Whisper-v3-large + GPT-4oCascadedClosed	4.80	4.47	4.62	75.77	81.69	92.97	87.20	76.51	98.27	87.80	—
5	Ultravox-GLM-4P6Audio-LLMOpen	4.93	4.42	4.57	84.24	80.40	89.45	81.48	75.59	99.23	87.05	—
6	GPT-4o-AudioOmniClosed	4.78	4.49	4.58	75.50	80.25	89.23	84.10	76.02	98.65	86.75	—
7	GPT-4o-mini-AudioOmniClosed	4.75	4.24	4.40	67.36	72.90	84.84	81.50	72.90	98.27	82.84	—
8	Ultravox-v0.6-LLaMA-3.3-70BAudio-LLMOpen	4.69	4.26	4.38	82.60	69.20	86.40	78.80	61.50	91.20	81.81	—
9	LFG-1Audio-LLMOpen	4.60	3.71	4.01	62.39	75.60	82.42	83.90	74.85	91.15	79.63	—
10	Parakeet-TDT-0.6b-V2 + Qwen3-8BCascadedOpen	4.68	4.46	4.35	47.47	59.10	80.00	77.90	78.99	99.81	79.23	—
11	Whisper-v3-large + LLaMA-3.1-8BCascadedOpen	4.53	4.04	4.16	70.43	62.43	72.53	69.70	69.53	98.08	77.48	—
12	Kimi-AudioOmniOpen	4.46	3.97	4.20	63.12	62.17	83.52	69.70	61.10	100.00	76.91	—
13	Whisper-v3-turbo + LLaMA-3.1-8BCascadedOpen	4.55	4.02	4.12	58.23	62.04	72.09	69.10	71.12	98.46	76.09	—
14	Ultravox-v0.5-LLaMA-3.1-8BAudio-LLMOpen	4.59	4.11	4.28	58.68	54.16	68.35	67.80	66.51	98.65	74.86	—
15	Ultravox-v0.4.1-LLaMA-3.1-8BAudio-LLMOpen	4.55	3.90	4.12	53.35	47.17	65.27	66.30	66.88	98.46	72.09	—
16	Baichuan-Omni-1.5OmniOpen	4.50	4.05	4.06	43.40	57.25	74.51	62.70	54.54	97.31	71.32	—
17	MiniCPM-oOmniOpen	4.42	4.15	3.94	50.72	54.78	78.02	60.40	49.25	97.69	71.23	—
18	Whisper-v3-turbo + LLaMA-3.2-3BCascadedOpen	4.45	3.82	4.04	49.28	51.37	60.66	63.90	69.71	98.08	71.02	—
19	Baichuan-AudioOmniOpen	4.41	4.08	3.92	45.84	53.19	71.65	54.80	50.31	99.42	69.27	—
20	MERaLiONAudio-LLMOpen	4.50	3.77	4.12	55.06	34.95	27.23	62.60	62.93	94.81	65.04	—
21	VITA-1.5OmniOpen	4.21	3.66	3.48	38.88	52.15	71.65	55.30	38.14	97.69	64.53	—
22	Phi-4-multimodalVision+Audio+LLMOpen	3.81	3.82	3.56	39.78	42.19	65.93	61.80	45.35	100.00	64.32	—
23	OlaOmniOpen	4.12	2.97	3.19	33.82	45.97	67.91	51.10	39.57	90.77	59.42	—
24	Lyra-BaseOmniOpen	3.85	3.50	3.42	38.25	49.74	72.75	59.00	36.28	59.62	59.00	—
25	Nemotron 3 VoiceChat (V1)S2S / Full-DuplexOpen	3.59	3.12	3.06	43.40	50.46	64.40	52.70	17.12	99.62	58.10	—
26	Ultravox-v0.5-LLaMA-3.2-1BAudio-LLMOpen	4.04	3.57	3.47	34.72	30.03	35.60	52.70	45.56	96.92	57.46	—
27	DiVAAudio-LLMOpen	3.67	3.54	3.74	57.05	25.76	25.49	51.80	39.15	98.27	57.39	—
28	GLM-4-VoiceOmniOpen	3.97	3.42	3.18	36.98	39.75	53.41	52.80	25.92	88.08	56.48	—
29	Qwen2-AudioAudio-LLMOpen	3.74	3.43	3.01	35.71	35.72	49.45	54.70	26.33	96.73	55.80	—
30	Freeze-OmniS2S / Full-DuplexOpen	4.03	3.46	3.15	53.45	28.14	30.98	50.70	23.40	97.30	55.20	—
31	Step-AudioOmniOpen	4.13	3.09	2.93	44.21	28.33	33.85	50.60	27.96	69.62	50.84	—
32	Megrez-3B-OmniOmniOpen	3.50	2.95	2.34	25.95	27.03	28.35	50.30	25.71	87.69	46.76	—
33	IchigoOmniOpen	3.79	3.17	2.83	36.53	25.63	26.59	46.50	21.59	57.50	45.57	—
34	Lyra-MiniOmniOpen	2.99	2.69	2.58	19.89	31.42	41.54	48.40	20.91	80.00	45.26	—
35	Mair-hub-0.5B-OmniOmniOpen	3.06	2.87	2.48	21.70	25.60	25.27	50.90	14.85	94.81	44.59	—
36	LLaMA-OmniOmniOpen	3.70	3.46	2.92	39.69	25.93	27.47	49.20	14.87	11.35	41.12	—
37	VITA-1.0OmniOpen	3.38	2.15	1.87	27.94	25.70	29.01	47.70	22.82	26.73	36.43	—
38	SLAM-OmniOmniOpen	1.90	1.79	1.60	4.16	26.06	25.27	48.80	13.38	94.23	35.30	—
39	Mini-Omni2OmniOpen	2.32	2.18	1.79	9.31	24.27	26.59	46.40	11.56	57.50	33.49	—
40	Mini-OmniOmniOpen	1.95	2.02	1.61	13.92	24.69	26.59	46.30	13.58	37.12	30.42	—
41	MoshiS2S / Full-DuplexOpen	2.01	1.60	1.30	15.64	24.04	25.93	47.40	10.12	44.23	29.51	—

Scores from VoiceBench. Architecture types are community-classified. Submit new results via the issue tracker.

About VoiceBench

A comprehensive benchmark for LLM-based voice assistants across 9 evaluation datasets, covering diverse speaker characteristics, environmental factors, and content variations. To submit your model, open an issue on GitHub.

Architecture types: Cascaded = separate ASR + LLM pipeline · Audio-LLM = audio encoder fused with LLM, text output only · Vision+Audio+LLM = vision & audio encoders fused with LLM, text output · Omni = end-to-end speech-in speech-out, turn-based · S2S / Full-Duplex = simultaneous listen & speak, no external VAD

Citation

If you use VoiceBench in your research, please cite the following paper:

@article{chen2024voicebench,
  title={VoiceBench: Benchmarking LLM-Based Voice Assistants},
  author={Chen, Yiming and Yue, Xianghu and Zhang, Chen and Gao, Xiaoxue and Tan, Robby T. and Li, Haizhou},
  journal={arXiv preprint arXiv:2410.17196},
  year={2024}
}