VoiceBench — Leaderboard

Benchmarking LLM-Based Voice Assistants · 40 Models · 5 Architecture Types · Click cards to filter

40
All Models
Full leaderboard
#1: NVIDIA Nemotron 3 Nano Omni 30B A3B (89.4)
5
Cascaded
Separate ASR + LLM pipeline
#1: Whisper + GPT-4o (87.8)
10
Audio-LLM
Audio encoder + LLM, text output
#1: Ultravox-GLM-4P7 (88.9)
2
Vision+Audio+LLM
Vision & audio encoders + LLM, text output
#1: NVIDIA Nemotron 3 Nano Omni 30B A3B (89.4)
20
Omni
Speech-in & speech-out, turn-based
#1: GPT-4o-Audio (86.8)
3
S2S / Full-Duplex
Simultaneous listen & speak
#1: Nemotron 3 VoiceChat (V1) (58.1)
Weights: All (40) Open (37) Closed (3)
Cascaded
Audio-LLM
Vision+Audio+LLM
Omni
S2S / Full-Duplex
Open
Closed
#Model AlpacaEvalCommonEvalWildVoiceSD-QAMMSUOBQABBHIFEvalAdvBench OverallCat. Rank
1 NVIDIA Nemotron 3 Nano Omni 30B A3BVision+Audio+LLMOpen 4.754.574.5871.4382.3092.9791.1088.66100.00 89.39
2 Ultravox-GLM-4P7Audio-LLMOpen 4.874.304.5584.1583.8294.7287.1676.2699.23 88.86
3 Ultravox-GLM-4P7 (thinking)Audio-LLMOpen 4.694.054.2188.5787.1595.0189.7880.9998.46 88.79
4 Whisper-v3-large + GPT-4oCascadedClosed 4.804.474.6275.7781.6992.9787.2076.5198.27 87.80
5 Ultravox-GLM-4P6Audio-LLMOpen 4.934.424.5784.2480.4089.4581.4875.5999.23 87.05
6 GPT-4o-AudioOmniClosed 4.784.494.5875.5080.2589.2384.1076.0298.65 86.75
7 GPT-4o-mini-AudioOmniClosed 4.754.244.4067.3672.9084.8481.5072.9098.27 82.84
8 Ultravox-v0.6-LLaMA-3.3-70BAudio-LLMOpen 4.694.264.3882.6069.2086.4078.8061.5091.20 81.81
9 Parakeet-TDT-0.6b-V2 + Qwen3-8BCascadedOpen 4.684.464.3547.4759.1080.0077.9078.9999.81 79.23
10 Whisper-v3-large + LLaMA-3.1-8BCascadedOpen 4.534.044.1670.4362.4372.5369.7069.5398.08 77.48
11 Kimi-AudioOmniOpen 4.463.974.2063.1262.1783.5269.7061.10100.00 76.91
12 Whisper-v3-turbo + LLaMA-3.1-8BCascadedOpen 4.554.024.1258.2362.0472.0969.1071.1298.46 76.09
13 Ultravox-v0.5-LLaMA-3.1-8BAudio-LLMOpen 4.594.114.2858.6854.1668.3567.8066.5198.65 74.86
14 Ultravox-v0.4.1-LLaMA-3.1-8BAudio-LLMOpen 4.553.904.1253.3547.1765.2766.3066.8898.46 72.09
15 Baichuan-Omni-1.5OmniOpen 4.504.054.0643.4057.2574.5162.7054.5497.31 71.32
16 MiniCPM-oOmniOpen 4.424.153.9450.7254.7878.0260.4049.2597.69 71.23
17 Whisper-v3-turbo + LLaMA-3.2-3BCascadedOpen 4.453.824.0449.2851.3760.6663.9069.7198.08 71.02
18 Baichuan-AudioOmniOpen 4.414.083.9245.8453.1971.6554.8050.3199.42 69.27
19 MERaLiONAudio-LLMOpen 4.503.774.1255.0634.9527.2362.6062.9394.81 65.04
20 VITA-1.5OmniOpen 4.213.663.4838.8852.1571.6555.3038.1497.69 64.53
21 Phi-4-multimodalVision+Audio+LLMOpen 3.813.823.5639.7842.1965.9361.8045.35100.00 64.32
22 OlaOmniOpen 4.122.973.1933.8245.9767.9151.1039.5790.77 59.42
23 Lyra-BaseOmniOpen 3.853.503.4238.2549.7472.7559.0036.2859.62 59.00
24 Nemotron 3 VoiceChat (V1)S2S / Full-DuplexOpen 3.593.123.0643.4050.4664.4052.7017.1299.62 58.10
25 Ultravox-v0.5-LLaMA-3.2-1BAudio-LLMOpen 4.043.573.4734.7230.0335.6052.7045.5696.92 57.46
26 DiVAAudio-LLMOpen 3.673.543.7457.0525.7625.4951.8039.1598.27 57.39
27 GLM-4-VoiceOmniOpen 3.973.423.1836.9839.7553.4152.8025.9288.08 56.48
28 Qwen2-AudioAudio-LLMOpen 3.743.433.0135.7135.7249.4554.7026.3396.73 55.80
29 Freeze-OmniS2S / Full-DuplexOpen 4.033.463.1553.4528.1430.9850.7023.4097.30 55.20
30 Step-AudioOmniOpen 4.133.092.9344.2128.3333.8550.6027.9669.62 50.84
31 Megrez-3B-OmniOmniOpen 3.502.952.3425.9527.0328.3550.3025.7187.69 46.76
32 IchigoOmniOpen 3.793.172.8336.5325.6326.5946.5021.5957.50 45.57
33 Lyra-MiniOmniOpen 2.992.692.5819.8931.4241.5448.4020.9180.00 45.26
34 Mair-hub-0.5B-OmniOmniOpen 3.062.872.4821.7025.6025.2750.9014.8594.81 44.59
35 LLaMA-OmniOmniOpen 3.703.462.9239.6925.9327.4749.2014.8711.35 41.12
36 VITA-1.0OmniOpen 3.382.151.8727.9425.7029.0147.7022.8226.73 36.43
37 SLAM-OmniOmniOpen 1.901.791.604.1626.0625.2748.8013.3894.23 35.30
38 Mini-Omni2OmniOpen 2.322.181.799.3124.2726.5946.4011.5657.50 33.49
39 Mini-OmniOmniOpen 1.952.021.6113.9224.6926.5946.3013.5837.12 30.42
40 MoshiS2S / Full-DuplexOpen 2.011.601.3015.6424.0425.9347.4010.1244.23 29.51
Scores from VoiceBench. Architecture types are community-classified. Submit new results via the issue tracker.

About VoiceBench

A comprehensive benchmark for LLM-based voice assistants across 9 evaluation datasets, covering diverse speaker characteristics, environmental factors, and content variations. To submit your model, open an issue on GitHub.

Architecture types: Cascaded = separate ASR + LLM pipeline · Audio-LLM = audio encoder fused with LLM, text output only · Vision+Audio+LLM = vision & audio encoders fused with LLM, text output · Omni = end-to-end speech-in speech-out, turn-based · S2S / Full-Duplex = simultaneous listen & speak, no external VAD

Citation

If you use VoiceBench in your research, please cite the following paper:

@article{chen2024voicebench,
  title={VoiceBench: Benchmarking LLM-Based Voice Assistants},
  author={Chen, Yiming and Yue, Xianghu and Zhang, Chen and Gao, Xiaoxue and Tan, Robby T. and Li, Haizhou},
  journal={arXiv preprint arXiv:2410.17196},
  year={2024}
}