Table of Contents
Fetching ...

ADIFF: Explaining audio difference using natural language

Soham Deshmukh, Shuo Han, Rita Singh, Bhiksha Raj

TL;DR

This paper tackles explaining differences between audio recordings in natural language, addressing applications in forensics, quality assessment, and audio generation. It introduces two datasets, AudioCaps Difference (ACD) and Clotho Difference (CLD), and generates three tiers of explanations via LLM prompting. It then proposes ADIFF, a prefix-tuning based model with a cross-projection module and a three-stage training process, to produce detailed, human-like explanations. Across objective metrics and human judgments, ADIFF outperforms a naive baseline and the SoTA Audio-Language Model, and ablation studies illuminate the contributions of cross-projection, audio grounding, and staged training. The benchmarks and findings pave the way for nuanced explanations of audio differences with potential impact on forensic analysis and realistic audio synthesis.

Abstract

Understanding and explaining differences between audio recordings is crucial for fields like audio forensics, quality assessment, and audio generation. This involves identifying and describing audio events, acoustic scenes, signal characteristics, and their emotional impact on listeners. This paper stands out as the first work to comprehensively study the task of explaining audio differences and then propose benchmark, baselines for the task. First, we present two new datasets for audio difference explanation derived from the AudioCaps and Clotho audio captioning datasets. Using Large Language Models (LLMs), we generate three levels of difference explanations: (1) concise descriptions of audio events and objects, (2) brief sentences about audio events, acoustic scenes, and signal properties, and (3) comprehensive explanations that include semantics and listener emotions. For the baseline, we use prefix tuning where audio embeddings from two audio files are used to prompt a frozen language model. Our empirical analysis and ablation studies reveal that the naive baseline struggles to distinguish perceptually similar sounds and generate detailed tier 3 explanations. To address these limitations, we propose ADIFF, which introduces a cross-projection module, position captioning, and a three-step training process to enhance the model's ability to produce detailed explanations. We evaluate our model using objective metrics and human evaluation and show our model enhancements lead to significant improvements in performance over naive baseline and SoTA Audio-Language Model (ALM) Qwen Audio. Lastly, we conduct multiple ablation studies to study the effects of cross-projection, language model parameters, position captioning, third stage fine-tuning, and present our findings. Our benchmarks, findings, and strong baseline pave the way for nuanced and human-like explanations of audio differences.

ADIFF: Explaining audio difference using natural language

TL;DR

This paper tackles explaining differences between audio recordings in natural language, addressing applications in forensics, quality assessment, and audio generation. It introduces two datasets, AudioCaps Difference (ACD) and Clotho Difference (CLD), and generates three tiers of explanations via LLM prompting. It then proposes ADIFF, a prefix-tuning based model with a cross-projection module and a three-stage training process, to produce detailed, human-like explanations. Across objective metrics and human judgments, ADIFF outperforms a naive baseline and the SoTA Audio-Language Model, and ablation studies illuminate the contributions of cross-projection, audio grounding, and staged training. The benchmarks and findings pave the way for nuanced explanations of audio differences with potential impact on forensic analysis and realistic audio synthesis.

Abstract

Understanding and explaining differences between audio recordings is crucial for fields like audio forensics, quality assessment, and audio generation. This involves identifying and describing audio events, acoustic scenes, signal characteristics, and their emotional impact on listeners. This paper stands out as the first work to comprehensively study the task of explaining audio differences and then propose benchmark, baselines for the task. First, we present two new datasets for audio difference explanation derived from the AudioCaps and Clotho audio captioning datasets. Using Large Language Models (LLMs), we generate three levels of difference explanations: (1) concise descriptions of audio events and objects, (2) brief sentences about audio events, acoustic scenes, and signal properties, and (3) comprehensive explanations that include semantics and listener emotions. For the baseline, we use prefix tuning where audio embeddings from two audio files are used to prompt a frozen language model. Our empirical analysis and ablation studies reveal that the naive baseline struggles to distinguish perceptually similar sounds and generate detailed tier 3 explanations. To address these limitations, we propose ADIFF, which introduces a cross-projection module, position captioning, and a three-step training process to enhance the model's ability to produce detailed explanations. We evaluate our model using objective metrics and human evaluation and show our model enhancements lead to significant improvements in performance over naive baseline and SoTA Audio-Language Model (ALM) Qwen Audio. Lastly, we conduct multiple ablation studies to study the effects of cross-projection, language model parameters, position captioning, third stage fine-tuning, and present our findings. Our benchmarks, findings, and strong baseline pave the way for nuanced and human-like explanations of audio differences.

Paper Structure

This paper contains 41 sections, 4 equations, 11 figures, 25 tables.

Figures (11)

  • Figure 1: Humans use auditory information to compare scenes and make deductions.
  • Figure 2: A random sample from the ACD dataset is displayed across three levels of explanation. The top pane provides a concise explanation, the middle pane offers a brief explanation, and the bottom pane presents a detailed explanation.
  • Figure 3: ADIFF takes two audio recordings and text prompt as input and generates free-form text as output. The two audios and prompt are independently encoded by the audio encoder and text embedder respectively, followed by projection layers to project the embeddings to the latent space of the transformer decoder. The two audio latent are separated by a separator token in latent space. The prefix formed by audio latent 1, SEP, audio latent 2 and text prompt prefix is fed to the cross-projection layer. The output of the cross-projection layer is used to prompt the transformer decoder to generate natural language explanations.
  • Figure 4: Change in average score across tiers with increase in LM parameters.
  • Figure 5: Audio event presence probabilities from ADIFF to detect hallucinations.
  • ...and 6 more figures