How Contrastive Decoding Enhances Large Audio Language Models?

Tzu-Quan Lin; Wei-Ping Huang; Yi-Cheng Lin; Hung-yi Lee

How Contrastive Decoding Enhances Large Audio Language Models?

Tzu-Quan Lin, Wei-Ping Huang, Yi-Cheng Lin, Hung-yi Lee

TL;DR

This analysis demonstrates that CD reliably rectifies errors in which models falsely claim an absence of audio or resort to uncertainty-driven guessing, and provides a clear guideline for determining which LALM architectures are most suitable for CD enhancement based on their baseline error profiles.

Abstract

While Contrastive Decoding (CD) has proven effective at enhancing Large Audio Language Models (LALMs), the underlying mechanisms driving its success and the comparative efficacy of different strategies remain unclear. This study systematically evaluates four distinct CD strategies across diverse LALM architectures. We identify Audio-Aware Decoding and Audio Contrastive Decoding as the most effective methods. However, their impact varies significantly by model. To explain this variability, we introduce a Transition Matrix framework to map error pattern shifts during inference. Our analysis demonstrates that CD reliably rectifies errors in which models falsely claim an absence of audio or resort to uncertainty-driven guessing. Conversely, it fails to correct flawed reasoning or confident misassertions. Ultimately, these findings provide a clear guideline for determining which LALM architectures are most suitable for CD enhancement based on their baseline error profiles.

How Contrastive Decoding Enhances Large Audio Language Models?

TL;DR

Abstract

Paper Structure (16 sections, 7 equations, 2 figures, 1 table)

This paper contains 16 sections, 7 equations, 2 figures, 1 table.

Introduction
Background
Large Audio Language Models
Contrastive Decoding Methods
Analysis Methods
Categorization of Response States
Automated Evaluation via LLM-as-a-Judge
The Transition Matrix Framework
Experimental Setup
Benchmarks
Implementation Details
Findings
Performance
Transition Matrix Analysis
Conclusion
...and 1 more sections

Figures (2)

Figure 1: Overview of different contrastive decoding methods on LALM. In the diagrams, $z_t$ represents the expert logits, while $\hat{z}_t$ denotes the amateur logits.
Figure 2: Detailed comparison of transition matrices for different models, averaged across all tasks and the two most consistently effective methods (AAD and ACD). Complete transition matrices for each individual task are available in https://github.com/nervjack2/LALM-Contrastive-Decoding-Error-Profiles/tree/main/transition_matrix_examples.

How Contrastive Decoding Enhances Large Audio Language Models?

TL;DR

Abstract

How Contrastive Decoding Enhances Large Audio Language Models?

Authors

TL;DR

Abstract

Table of Contents

Figures (2)