Table of Contents
Fetching ...

Voices of Civilizations: A Multilingual QA Benchmark for Global Music Understanding

Shangda Wu, Ziya Zhou, Yongyi Zang, Yutong Zheng, Dafang Liang, Ruibin Yuan, Qiuqiang Kong

TL;DR

Voices of Civilizations is introduced, the first multilingual QA benchmark for evaluating audio LLMs'cultural comprehension on full-length music recordings and demonstrates that even state-of-the-art audio LLMs struggle to capture subtle cultural nuances without rich textual context.

Abstract

We introduce Voices of Civilizations, the first multilingual QA benchmark for evaluating audio LLMs' cultural comprehension on full-length music recordings. Covering 380 tracks across 38 languages, our automated pipeline yields 1,190 multiple-choice questions through four stages - each followed by manual verification: 1) compiling a representative music list; 2) generating cultural-background documents for each sample in the music list via LLMs; 3) extracting key attributes from those documents; and 4) constructing multiple-choice questions probing language, region associations, mood, and thematic content. We evaluate models under four conditions and report per-language accuracy. Our findings demonstrate that even state-of-the-art audio LLMs struggle to capture subtle cultural nuances without rich textual context and exhibit systematic biases in interpreting music from different cultural traditions. The dataset is publicly available on Hugging Face to foster culturally inclusive music understanding research.

Voices of Civilizations: A Multilingual QA Benchmark for Global Music Understanding

TL;DR

Voices of Civilizations is introduced, the first multilingual QA benchmark for evaluating audio LLMs'cultural comprehension on full-length music recordings and demonstrates that even state-of-the-art audio LLMs struggle to capture subtle cultural nuances without rich textual context.

Abstract

We introduce Voices of Civilizations, the first multilingual QA benchmark for evaluating audio LLMs' cultural comprehension on full-length music recordings. Covering 380 tracks across 38 languages, our automated pipeline yields 1,190 multiple-choice questions through four stages - each followed by manual verification: 1) compiling a representative music list; 2) generating cultural-background documents for each sample in the music list via LLMs; 3) extracting key attributes from those documents; and 4) constructing multiple-choice questions probing language, region associations, mood, and thematic content. We evaluate models under four conditions and report per-language accuracy. Our findings demonstrate that even state-of-the-art audio LLMs struggle to capture subtle cultural nuances without rich textual context and exhibit systematic biases in interpreting music from different cultural traditions. The dataset is publicly available on Hugging Face to foster culturally inclusive music understanding research.
Paper Structure (4 sections, 3 figures, 1 table)

This paper contains 4 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Example questions from the Voices of Civilizations benchmark on three folk songs—Arabic "Jafra," Chinese "Liuyang River", and Korean "Arirang."
  • Figure 2: Per-language accuracy (%) of three state-of-the-art audio LLMs on the VoC benchmark using audio input only and focusing on region, mood, and theme questions. We invited a Chinese music teacher to answer 29 questions across 10 Chinese songs in a strictly closed-book setting (no reference or lookup allowed), achieving an accuracy of 79.31%.
  • Figure :