PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

Yuanjian Chen; Yang Xiao; Han Yin; Xubo Liu; Jinjie Huang; Ting Dang

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

Yuanjian Chen, Yang Xiao, Han Yin, Xubo Liu, Jinjie Huang, Ting Dang

TL;DR

Evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic audio, indicating a fundamental bottleneck in current LALMs, according to this work.

Abstract

Large Audio Language Models (LALMs) are increasingly capable of reasoning over audio. However, existing benchmarks provide limited coverage of reasoning in polyphonic audio, where multiple sound events co-occur and induce compositional structure. In this work, we introduce PolyBench, a benchmark designed to evaluate compositional reasoning in polyphonic audio. PolyBench comprises five evaluation subsets covering counting, classification, detection, concurrency, and duration estimation, requiring reasoning over multiple concurrent events and their relations. Evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic audio, indicating a fundamental bottleneck in current LALMs.

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

TL;DR

Evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic audio, indicating a fundamental bottleneck in current LALMs, according to this work.

Abstract

Paper Structure (14 sections, 2 figures, 2 tables)

This paper contains 14 sections, 2 figures, 2 tables.

Introduction
Benchmark
Stage 1: Problem Observation
Stage 2: Data and Questions Preparation
MCQA Tasks in PolyBench
Audio Sources and Statistics
Stage 3: Metric and Evaluation
Experimental Setups
Result and Analysis
Bottleneck in Polyphonic Scenarios
Robustness Divergence in Compositional Reasoning
Performance Illustration and Shortcut Learning in Concurrency
Conclusion
Generative AI Use Disclosure

Figures (2)

Figure 1: PolyBench pipeline for benchmark construction and evaluation. The process includes (1) observing polyphony-induced failure patterns by contrasting monophonic and polyphonic audio; (2) curating real-world polyphonic clips and generating MCQAs for five task types via human–LLM collaboration with iterative quality control; and (3) evaluating LALMs by scoring their MCQA predictions against golden answers. The right panel shows representative examples of the five tasks (Counting, Duration, Concurrency, Classification, and Detection).
Figure 2: Visual analysis of PolyBench dataset characteristics. Left: distribution of PolyBench event tags across top-level AudioSet ontology categories—CE (Channel, environment and background), MU (Music), NS (Natural sounds), AS (Animal sounds), HS (Human sounds), and ST (Sounds of things). Numbers at the end of each bar indicate the event occurrence counts in the dataset, while the x-axis shows the proportion of tags within each source dataset. Right: proportion of audio clips by the number of overlapping sound sources (e.g., 2, 3, and 4 concurrent events).

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

TL;DR

Abstract

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

Authors

TL;DR

Abstract

Table of Contents

Figures (2)