Table of Contents
Fetching ...

AudSemThinker: Enhancing Audio-Language Models through Reasoning over Semantics of Sound

Gijs Wijngaard, Elia Formisano, Michele Esposito, Michel Dumontier

TL;DR

AudSemThinker advances audio-language understanding by embedding explicit reasoning over fine-grained auditory semantics through a thinking phase and semantic descriptor analysis. It introduces AudSem, a diverse, low-overlap dataset curated from YouTube captions to mitigate data contamination, and demonstrates two training paradigms—Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO)—across MMAU and AudioBench benchmarks. Empirical results show strong performance, particularly in music-related tasks, with ablations underscoring the value of semantic descriptors and controlled thinking budgets. The work highlights how structured reasoning and careful dataset design can enhance generalization in audio-language models and provides publicly released resources for the community.

Abstract

Audio-language models have shown promising results in various sound understanding tasks, yet they remain limited in their ability to reason over the fine-grained semantics of sound. In this paper, we present AudSemThinker, a model whose reasoning is structured around a framework of auditory semantics inspired by human cognition. To support this, we introduce AudSem, a novel dataset specifically curated for semantic descriptor reasoning in audio-language models. AudSem addresses the persistent challenge of data contamination in zero-shot evaluations by providing a carefully filtered collection of audio samples paired with captions generated through a robust multi-stage pipeline. Our experiments demonstrate that AudSemThinker outperforms state-of-the-art models across multiple training settings, highlighting its strength in semantic audio reasoning. Both AudSemThinker and the AudSem dataset are released publicly.

AudSemThinker: Enhancing Audio-Language Models through Reasoning over Semantics of Sound

TL;DR

AudSemThinker advances audio-language understanding by embedding explicit reasoning over fine-grained auditory semantics through a thinking phase and semantic descriptor analysis. It introduces AudSem, a diverse, low-overlap dataset curated from YouTube captions to mitigate data contamination, and demonstrates two training paradigms—Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO)—across MMAU and AudioBench benchmarks. Empirical results show strong performance, particularly in music-related tasks, with ablations underscoring the value of semantic descriptors and controlled thinking budgets. The work highlights how structured reasoning and careful dataset design can enhance generalization in audio-language models and provides publicly released resources for the community.

Abstract

Audio-language models have shown promising results in various sound understanding tasks, yet they remain limited in their ability to reason over the fine-grained semantics of sound. In this paper, we present AudSemThinker, a model whose reasoning is structured around a framework of auditory semantics inspired by human cognition. To support this, we introduce AudSem, a novel dataset specifically curated for semantic descriptor reasoning in audio-language models. AudSem addresses the persistent challenge of data contamination in zero-shot evaluations by providing a carefully filtered collection of audio samples paired with captions generated through a robust multi-stage pipeline. Our experiments demonstrate that AudSemThinker outperforms state-of-the-art models across multiple training settings, highlighting its strength in semantic audio reasoning. Both AudSemThinker and the AudSem dataset are released publicly.

Paper Structure

This paper contains 28 sections, 7 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Schematic overview of the AudSemThinker model and the AudSem dataset. Left: The AudSemThinker model's reasoning process, involving a thinking phase, semantic descriptor analysis, and answer generation. Right: Example of four types of tasks in the AudSem dataset: open-ended question answering, multiple-choice question answering, audio captioning, and creative writing.
  • Figure 2: Pipeline visualization of the creation of AudSem. Models with blue background are generative language models, while models with green background are classification models that predict outputs from a fixed set of possible labels.
  • Figure 3: Visual analysis of AudSem dataset characteristics, showing distributions of sound categories (left) and PCA projection of caption embeddings (right). The filtered dataset was created by filtering out audio caption embeddings with less than 0.5 similarity with the closed captions.
  • Figure 4: Analysis of thinking budget impact on model performance. Left: How applying a length constraint to the model's thinking nudges the mean length of the output. Right: The relationship between various length constraints and its length constraint reward.
  • Figure 5: Example of the audio-language model's output (two-phase, without semantic descriptors). The output contains a <thinking> section with detailed reasoning about the audio content, followed by an <answer> section with the concise caption.
  • ...and 1 more figures