VoiceBench — Leaderboard

Benchmarking LLM-Based Voice Assistants · 38 Models · 4 Architecture Types · Click cards to filter

38
All Models
Full leaderboard
#1: Ultravox-GLM-4P7 (88.9)
5
Cascaded
Separate ASR + LLM pipeline
#1: Whisper + GPT-4o (87.8)
11
Audio-LLM
Audio encoder + LLM, text output
#1: Ultravox-GLM-4P7 (88.9)
20
Omni
Speech-in & speech-out, turn-based
#1: GPT-4o-Audio (86.8)
2
S2S / Full-Duplex
Simultaneous listen & speak
#1: Freeze-Omni (55.2)
Weights: All (38) Open (35) Closed (3)
Cascaded
Audio-LLM
Omni
S2S / Full-Duplex
Open
Closed
#Model AlpacaEvalCommonEvalWildVoiceSD-QAMMSUOBQABBHIFEvalAdvBench OverallCat. Rank
1 Ultravox-GLM-4P7Audio-LLMOpen 4.874.304.5584.1583.8294.7287.1676.2699.23 88.86
2 Ultravox-GLM-4P7 (thinking)Audio-LLMOpen 4.694.054.2188.5787.1595.0189.7880.9998.46 88.79
3 Whisper-v3-large + GPT-4oCascadedClosed 4.804.474.6275.7781.6992.9787.2076.5198.27 87.80
4 Ultravox-GLM-4P6Audio-LLMOpen 4.934.424.5784.2480.4089.4581.4875.5999.23 87.05
5 GPT-4o-AudioOmniClosed 4.784.494.5875.5080.2589.2384.1076.0298.65 86.75
6 GPT-4o-mini-AudioOmniClosed 4.754.244.4067.3672.9084.8481.5072.9098.27 82.84
7 Ultravox-v0.6-LLaMA-3.3-70BAudio-LLMOpen 4.694.264.3882.6069.2086.4078.8061.5091.20 81.81
8 Parakeet-TDT-0.6b-V2 + Qwen3-8BCascadedOpen 4.684.464.3547.4759.1080.0077.9078.9999.81 79.23
9 Whisper-v3-large + LLaMA-3.1-8BCascadedOpen 4.534.044.1670.4362.4372.5369.7069.5398.08 77.48
10 Kimi-AudioOmniOpen 4.463.974.2063.1262.1783.5269.7061.10100.00 76.91
11 Whisper-v3-turbo + LLaMA-3.1-8BCascadedOpen 4.554.024.1258.2362.0472.0969.1071.1298.46 76.09
12 Ultravox-v0.5-LLaMA-3.1-8BAudio-LLMOpen 4.594.114.2858.6854.1668.3567.8066.5198.65 74.86
13 Ultravox-v0.4.1-LLaMA-3.1-8BAudio-LLMOpen 4.553.904.1253.3547.1765.2766.3066.8898.46 72.09
14 Baichuan-Omni-1.5OmniOpen 4.504.054.0643.4057.2574.5162.7054.5497.31 71.32
15 MiniCPM-oOmniOpen 4.424.153.9450.7254.7878.0260.4049.2597.69 71.23
16 Whisper-v3-turbo + LLaMA-3.2-3BCascadedOpen 4.453.824.0449.2851.3760.6663.9069.7198.08 71.02
17 Baichuan-AudioOmniOpen 4.414.083.9245.8453.1971.6554.8050.3199.42 69.27
18 MERaLiONAudio-LLMOpen 4.503.774.1255.0634.9527.2362.6062.9394.81 65.04
19 VITA-1.5OmniOpen 4.213.663.4838.8852.1571.6555.3038.1497.69 64.53
20 Phi-4-multimodalAudio-LLMOpen 3.813.823.5639.7842.1965.9361.8045.35100.00 64.32
21 OlaOmniOpen 4.122.973.1933.8245.9767.9151.1039.5790.77 59.42
22 Lyra-BaseOmniOpen 3.853.503.4238.2549.7472.7559.0036.2859.62 59.00
23 Ultravox-v0.5-LLaMA-3.2-1BAudio-LLMOpen 4.043.573.4734.7230.0335.6052.7045.5696.92 57.46
24 DiVAAudio-LLMOpen 3.673.543.7457.0525.7625.4951.8039.1598.27 57.39
25 GLM-4-VoiceOmniOpen 3.973.423.1836.9839.7553.4152.8025.9288.08 56.48
26 Qwen2-AudioAudio-LLMOpen 3.743.433.0135.7135.7249.4554.7026.3396.73 55.80
27 Freeze-OmniS2S / Full-DuplexOpen 4.033.463.1553.4528.1430.9850.7023.4097.30 55.20
28 Step-AudioOmniOpen 4.133.092.9344.2128.3333.8550.6027.9669.62 50.84
29 Megrez-3B-OmniOmniOpen 3.502.952.3425.9527.0328.3550.3025.7187.69 46.76
30 IchigoOmniOpen 3.793.172.8336.5325.6326.5946.5021.5957.50 45.57
31 Lyra-MiniOmniOpen 2.992.692.5819.8931.4241.5448.4020.9180.00 45.26
32 Mair-hub-0.5B-OmniOmniOpen 3.062.872.4821.7025.6025.2750.9014.8594.81 44.59
33 LLaMA-OmniOmniOpen 3.703.462.9239.6925.9327.4749.2014.8711.35 41.12
34 VITA-1.0OmniOpen 3.382.151.8727.9425.7029.0147.7022.8226.73 36.43
35 SLAM-OmniOmniOpen 1.901.791.604.1626.0625.2748.8013.3894.23 35.30
36 Mini-Omni2OmniOpen 2.322.181.799.3124.2726.5946.4011.5657.50 33.49
37 Mini-OmniOmniOpen 1.952.021.6113.9224.6926.5946.3013.5837.12 30.42
38 MoshiS2S / Full-DuplexOpen 2.011.601.3015.6424.0425.9347.4010.1244.23 29.51
Scores from VoiceBench. Architecture types are community-classified. Submit new results via the issue tracker.

About VoiceBench

A comprehensive benchmark for LLM-based voice assistants across 9 evaluation datasets, covering diverse speaker characteristics, environmental factors, and content variations. To submit your model, open an issue on GitHub.

Architecture types: Cascaded = separate ASR + LLM pipeline · Audio-LLM = audio encoder fused with LLM, text output only · Omni = end-to-end speech-in speech-out, turn-based · S2S / Full-Duplex = simultaneous listen & speak, no external VAD

Citation

If you use VoiceBench in your research, please cite the following paper:

@article{chen2024voicebench,
  title={VoiceBench: Benchmarking LLM-Based Voice Assistants},
  author={Chen, Yiming and Yue, Xianghu and Zhang, Chen and Gao, Xiaoxue and Tan, Robby T. and Li, Haizhou},
  journal={arXiv preprint arXiv:2410.17196},
  year={2024}
}