Table of Contents
Fetching ...

Audio Entailment: Assessing Deductive Reasoning for Audio Understanding

Soham Deshmukh, Shuo Han, Hazim Bukhari, Benjamin Elizalde, Hannes Gamper, Rita Singh, Bhiksha Raj

TL;DR

This work defines Audio Entailment to evaluate deductive reasoning in Audio-Language Models by pairing in-the-wild audio premises with hypotheses generated by large language models. It introduces two high-quality datasets, ACE and CLE, built from AudioCaps and Clotho with human-verified, LLM-generated hypotheses, enabling evaluation via entailment, neutral, or contradiction. Through extensive zero-shot and linear-probe experiments across contrastive and next-token ALMs, the study uncovers significant gaps in audio-grounded reasoning and model instruction-following, while showing that simple prompts and representation-learning improvements can help. A key contribution is the caption-before-reason approach, which yields an absolute improvement of about 6 percentage points in zero-shot F1 and 3 points in linear-probe F1, highlighting grounding as a critical factor for robust audio reasoning and providing a practical pathway to enhance ALMs without fine-tuning.

Abstract

Recent literature uses language to build foundation models for audio. These Audio-Language Models (ALMs) are trained on a vast number of audio-text pairs and show remarkable performance in tasks including Text-to-Audio Retrieval, Captioning, and Question Answering. However, their ability to engage in more complex open-ended tasks, like Interactive Question-Answering, requires proficiency in logical reasoning -- a skill not yet benchmarked. We introduce the novel task of Audio Entailment to evaluate an ALM's deductive reasoning ability. This task assesses whether a text description (hypothesis) of audio content can be deduced from an audio recording (premise), with potential conclusions being entailment, neutral, or contradiction, depending on the sufficiency of the evidence. We create two datasets for this task with audio recordings sourced from two audio captioning datasets -- AudioCaps and Clotho -- and hypotheses generated using Large Language Models (LLMs). We benchmark state-of-the-art ALMs and find deficiencies in logical reasoning with both zero-shot and linear probe evaluations. Finally, we propose "caption-before-reason", an intermediate step of captioning that improves the zero-shot and linear-probe performance of ALMs by an absolute 6% and 3%, respectively.

Audio Entailment: Assessing Deductive Reasoning for Audio Understanding

TL;DR

This work defines Audio Entailment to evaluate deductive reasoning in Audio-Language Models by pairing in-the-wild audio premises with hypotheses generated by large language models. It introduces two high-quality datasets, ACE and CLE, built from AudioCaps and Clotho with human-verified, LLM-generated hypotheses, enabling evaluation via entailment, neutral, or contradiction. Through extensive zero-shot and linear-probe experiments across contrastive and next-token ALMs, the study uncovers significant gaps in audio-grounded reasoning and model instruction-following, while showing that simple prompts and representation-learning improvements can help. A key contribution is the caption-before-reason approach, which yields an absolute improvement of about 6 percentage points in zero-shot F1 and 3 points in linear-probe F1, highlighting grounding as a critical factor for robust audio reasoning and providing a practical pathway to enhance ALMs without fine-tuning.

Abstract

Recent literature uses language to build foundation models for audio. These Audio-Language Models (ALMs) are trained on a vast number of audio-text pairs and show remarkable performance in tasks including Text-to-Audio Retrieval, Captioning, and Question Answering. However, their ability to engage in more complex open-ended tasks, like Interactive Question-Answering, requires proficiency in logical reasoning -- a skill not yet benchmarked. We introduce the novel task of Audio Entailment to evaluate an ALM's deductive reasoning ability. This task assesses whether a text description (hypothesis) of audio content can be deduced from an audio recording (premise), with potential conclusions being entailment, neutral, or contradiction, depending on the sufficiency of the evidence. We create two datasets for this task with audio recordings sourced from two audio captioning datasets -- AudioCaps and Clotho -- and hypotheses generated using Large Language Models (LLMs). We benchmark state-of-the-art ALMs and find deficiencies in logical reasoning with both zero-shot and linear probe evaluations. Finally, we propose "caption-before-reason", an intermediate step of captioning that improves the zero-shot and linear-probe performance of ALMs by an absolute 6% and 3%, respectively.
Paper Structure (22 sections, 5 figures, 14 tables)

This paper contains 22 sections, 5 figures, 14 tables.

Figures (5)

  • Figure 1: (Bottom left) Audio-Language Models have to infer Entailment, Neutral, or Contradiction from an audio premise $\mathcal{P}$ and a textual hypothesis $\mathcal{H}_*$. (Top) The highest performing Zero-Shot inference (or classification) is 57% F1 from LAION CLAP. (Bottom right) Our proposed method, combining MS CLAP 23 and a captioning step, enhances performance by an absolute 3% F1.
  • Figure 2: The figure shows two examples of the Audio Entailment task. The premise $\mathcal{P}$ consists of an audio recording and a hypothesis $\mathcal{H}_*$. The image and Description are for the reader illustration and not part of the task. Given the premise, Audio Entailment is determined for $H_1$, Neutral for $H_2$, and Contradiction for $H_3$ respectively.
  • Figure 3: “Caption-before-reason”: An intermediate step of audio captioning enhances performance in Audio Entailment tasks. The left figure illustrates a zero-shot setup where ALM is first asked to caption the audio before reasoning with the hypothesis. The right figure depicts a linear probe setup, where a caption and its embedding are generated before being passed to a classifier for prediction.
  • Figure 4: Top audio events present in the generated hypothesis for Clotho and Audio Entailment dataset.
  • Figure 5: Comparison of zero-shot prompting and "caption-before-reason" responses. The Audio-Language Model (ALM) used is Qwen-AC. The left pane displays the input, where audio and a hypothesis are provided to the ALM. The caption beside the audio is for reference and illustration purposes only. The second pane shows Qwen-AC's responses using zero-shot prompting. The third pane presents Qwen-AC's responses using the "caption-before-reason” method. Both methods involve zero-shot prompting and do not require model training or fine-tuning. Overall, Our method enhances the model’s ability to identify contradictions by providing explicit captions before reasoning. Previously, the model often aligned with the hypothesis, but with this new approach, it can better discern discrepancies between the hypothesis and the audio information. This technique helps the model avoid hallucinating sources based on the hypothesis and ensures better grounding in the audio input.