Table of Contents
Fetching ...

Investigating Faithfulness in Large Audio Language Models

Lovenya Jain, Pooneh Mousavi, Mirco Ravanelli, Cem Subakan

TL;DR

The paper addresses whether chain-of-thought explanations in Large Audio-Language Models faithfully reflect the models' reasoning, a crucial aspect for safety-critical use. It introduces a systematic intervention framework—filler tokens, paraphrasing, early answering, and mistaking content—applied to two LALMs (Qwen2-Audio-7B-Instruct and SALMONN-13B) on SAKURA and MMAR to probe the faithfulness of CoTs. The findings indicate that CoTs in LALMs largely align with the models' decision processes, with varying sensitivity to different interventions, and a notable dependence on the presence of the full reasoning chain. This work provides a practical methodology for faithfulness assessment in multimodal reasoning and motivates broader, cross-model analyses to improve reliability and interpretability in audio-augmented AI systems.

Abstract

Faithfulness measures whether chain-of-thought (CoT) representations accurately reflect a model's decision process and can be used as reliable explanations. Prior work has shown that CoTs from text-based LLMs are often unfaithful. This question has not been explored for large audio-language models (LALMs), where faithfulness is critical for safety-sensitive applications. Reasoning in LALMs is also more challenging, as models must first extract relevant clues from audio before reasoning over them. In this paper, we investigate the faithfulness of CoTs produced by several LALMs by applying targeted interventions, including paraphrasing, filler token injection, early answering, and introducing mistakes, on two challenging reasoning datasets: SAKURA and MMAR. After going through the aforementioned interventions across several datasets and tasks, our experiments suggest that, LALMs generally produce CoTs that appear to be faithful to their underlying decision processes.

Investigating Faithfulness in Large Audio Language Models

TL;DR

The paper addresses whether chain-of-thought explanations in Large Audio-Language Models faithfully reflect the models' reasoning, a crucial aspect for safety-critical use. It introduces a systematic intervention framework—filler tokens, paraphrasing, early answering, and mistaking content—applied to two LALMs (Qwen2-Audio-7B-Instruct and SALMONN-13B) on SAKURA and MMAR to probe the faithfulness of CoTs. The findings indicate that CoTs in LALMs largely align with the models' decision processes, with varying sensitivity to different interventions, and a notable dependence on the presence of the full reasoning chain. This work provides a practical methodology for faithfulness assessment in multimodal reasoning and motivates broader, cross-model analyses to improve reliability and interpretability in audio-augmented AI systems.

Abstract

Faithfulness measures whether chain-of-thought (CoT) representations accurately reflect a model's decision process and can be used as reliable explanations. Prior work has shown that CoTs from text-based LLMs are often unfaithful. This question has not been explored for large audio-language models (LALMs), where faithfulness is critical for safety-sensitive applications. Reasoning in LALMs is also more challenging, as models must first extract relevant clues from audio before reasoning over them. In this paper, we investigate the faithfulness of CoTs produced by several LALMs by applying targeted interventions, including paraphrasing, filler token injection, early answering, and introducing mistakes, on two challenging reasoning datasets: SAKURA and MMAR. After going through the aforementioned interventions across several datasets and tasks, our experiments suggest that, LALMs generally produce CoTs that appear to be faithful to their underlying decision processes.

Paper Structure

This paper contains 10 sections, 5 figures.

Figures (5)

  • Figure 1: (left) Filler token modification of CoT representation. We randomly mask a certain percentage of the CoT to see the effect on the answer that the model gives. (middle-left) Paraphrasing of CoT representation. We use an LLM to paraphrase the CoT. (middle-right) Early answering modification of CoT. We remove the last sentences depending on the rate of early answering. (right) Adding mistakes to the CoT. We add mistakes to the CoT with a certain rate.
  • Figure 2: Injecting filler tokens inside CoTs (left) for QWEN, (right) for SALMONN.
  • Figure 3: Paraphrasing of CoTs (left) for QWEN, (right) for SALMONN.
  • Figure 4: Early Answering modification on CoTs (left) for QWEN, (right) for SALMONN.
  • Figure 5: Adding Mistakes modification on CoTs (left) for QWEN, (right) for SALMONN.