Table of Contents
Fetching ...

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

Yuanjian Chen, Yang Xiao, Han Yin, Xubo Liu, Jinjie Huang, Ting Dang

TL;DR

Evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic audio, indicating a fundamental bottleneck in current LALMs, according to this work.

Abstract

Large Audio Language Models (LALMs) are increasingly capable of reasoning over audio. However, existing benchmarks provide limited coverage of reasoning in polyphonic audio, where multiple sound events co-occur and induce compositional structure. In this work, we introduce PolyBench, a benchmark designed to evaluate compositional reasoning in polyphonic audio. PolyBench comprises five evaluation subsets covering counting, classification, detection, concurrency, and duration estimation, requiring reasoning over multiple concurrent events and their relations. Evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic audio, indicating a fundamental bottleneck in current LALMs.

PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio

TL;DR

Evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic audio, indicating a fundamental bottleneck in current LALMs, according to this work.

Abstract

Large Audio Language Models (LALMs) are increasingly capable of reasoning over audio. However, existing benchmarks provide limited coverage of reasoning in polyphonic audio, where multiple sound events co-occur and induce compositional structure. In this work, we introduce PolyBench, a benchmark designed to evaluate compositional reasoning in polyphonic audio. PolyBench comprises five evaluation subsets covering counting, classification, detection, concurrency, and duration estimation, requiring reasoning over multiple concurrent events and their relations. Evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic audio, indicating a fundamental bottleneck in current LALMs.
Paper Structure (14 sections, 2 figures, 2 tables)

This paper contains 14 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: PolyBench pipeline for benchmark construction and evaluation. The process includes (1) observing polyphony-induced failure patterns by contrasting monophonic and polyphonic audio; (2) curating real-world polyphonic clips and generating MCQAs for five task types via human–LLM collaboration with iterative quality control; and (3) evaluating LALMs by scoring their MCQA predictions against golden answers. The right panel shows representative examples of the five tasks (Counting, Duration, Concurrency, Classification, and Detection).
  • Figure 2: Visual analysis of PolyBench dataset characteristics. Left: distribution of PolyBench event tags across top-level AudioSet ontology categories—CE (Channel, environment and background), MU (Music), NS (Natural sounds), AS (Animal sounds), HS (Human sounds), and ST (Sounds of things). Numbers at the end of each bar indicate the event occurrence counts in the dataset, while the x-axis shows the proportion of tags within each source dataset. Right: proportion of audio clips by the number of overlapping sound sources (e.g., 2, 3, and 4 concurrent events).