Table of Contents
Fetching ...

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, S. Ramaneswaran, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

TL;DR

This work identifies a gap in compositional reasoning for audio-language models trained with contrastive objectives. It introduces CompA, a dual-benchmark suite (CompA-order and CompA-attribute) to rigorously test order understanding and attribute-binding in ALMs, revealing that prior models perform only marginally better than random on these tasks. To address this, the authors propose CompA-CLAP, a two-stage fine-tuning of CLAP incorporating composition-aware hard negatives and a modular contrastive loss that enables fine-grained, multi-granularity understanding of audio scenes. The approach yields substantial improvements on CompA benchmarks while maintaining strong performance on standard retrieval and zero-shot tasks, underscoring the importance of dedicated compositional training data and modular learning for ALMs. These results suggest practical paths toward more thinkably compositional audio-language systems that better reflect real-world sound event relations.

Abstract

A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute-binding of acoustic events. An instance from either benchmark consists of two audio-caption pairs, where both audios have the same acoustic events but with different compositions. An ALM is evaluated on how well it matches the right audio to the right caption. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. To train CompA-CLAP, we first propose improvements to contrastive training with composition-aware hard negatives, allowing for more focused training. Next, we propose a novel modular contrastive loss that helps the model learn fine-grained compositional understanding and overcomes the acute scarcity of openly available compositional audios. CompA-CLAP significantly improves over all our baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities.

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

TL;DR

This work identifies a gap in compositional reasoning for audio-language models trained with contrastive objectives. It introduces CompA, a dual-benchmark suite (CompA-order and CompA-attribute) to rigorously test order understanding and attribute-binding in ALMs, revealing that prior models perform only marginally better than random on these tasks. To address this, the authors propose CompA-CLAP, a two-stage fine-tuning of CLAP incorporating composition-aware hard negatives and a modular contrastive loss that enables fine-grained, multi-granularity understanding of audio scenes. The approach yields substantial improvements on CompA benchmarks while maintaining strong performance on standard retrieval and zero-shot tasks, underscoring the importance of dedicated compositional training data and modular learning for ALMs. These results suggest practical paths toward more thinkably compositional audio-language systems that better reflect real-world sound event relations.

Abstract

A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute-binding of acoustic events. An instance from either benchmark consists of two audio-caption pairs, where both audios have the same acoustic events but with different compositions. An ALM is evaluated on how well it matches the right audio to the right caption. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. To train CompA-CLAP, we first propose improvements to contrastive training with composition-aware hard negatives, allowing for more focused training. Next, we propose a novel modular contrastive loss that helps the model learn fine-grained compositional understanding and overcomes the acute scarcity of openly available compositional audios. CompA-CLAP significantly improves over all our baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities.
Paper Structure (30 sections, 7 equations, 8 figures, 14 tables, 1 algorithm)

This paper contains 30 sections, 7 equations, 8 figures, 14 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison of CLAP performance after shuffling the word order in captions. CLAP undergoes an average degradation of 0.04 in top-1(R@1) and 0.03 in top-10 (R@10) retrieval.
  • Figure 2: CompA-order evaluates an ALMs' capability to understand the order of occurrence between multiple acoustic events in an audio.
  • Figure 3: CompA-attribute evaluates an ALM's capability to understand attribute-binding for multiple acoustic events in an audio.
  • Figure 4: Comparison of unique acoustic events per audio between LAION Audio-630K and CompA-AudioSet.
  • Figure 5: Illustration of contrastive learning techniques for improving compositional reasoning in ALMs.Left: Contrastive training with compositionally-aware hard negatives where each audio has $K$ hard negative captions generated using an LLM, and each audio in the batch ignores negatives of other audios in the batch for more focused training. Right: Our proposed Modular Contrastive training employs multiple positives and negatives for each audio in the batch generated using a template-based algorithm. Each positive describes compositional relationships of various granularities in the audio, and this helps the model learn fine-grained order and attribute-binding. An audio in the batch ignores the positives and negatives of other audios.
  • ...and 3 more figures