Knowing When to Answer: Adaptive Confidence Refinement for Reliable Audio-Visual Question Answering

Dinh Phu Tran; Jihoon Jeong; Saad Wazir; Seongah Kim; Thao Do; Cem Subakan; Daeyoung Kim

Knowing When to Answer: Adaptive Confidence Refinement for Reliable Audio-Visual Question Answering

Dinh Phu Tran, Jihoon Jeong, Saad Wazir, Seongah Kim, Thao Do, Cem Subakan, Daeyoung Kim

TL;DR

This paper reframes audio-visual question answering as a reliable task ($\mathcal{R}$-AVQA) where abstention is preferred over incorrect answers. It introduces Adaptive Confidence Refinement (ACR), a lightweight, input-adaptive confidence framework that preserves the strong baseline MSP while adding a Residual Risk Head and a Confidence Gating Head to refine uncertainty using multimodal features; the final confidence is $C_{\mathrm{ACR}}(\boldsymbol{x}) = \alpha(\boldsymbol{x})C_{\mathrm{M}}(\boldsymbol{x}) + (1-\alpha(\boldsymbol{x}))C_{\mathrm{R}}(\boldsymbol{x})$. Through two-stage training, ACR learns to approximate the posterior probability of correctness $\mathbb{P}(c(\boldsymbol{x})=1|\boldsymbol{x})$, and theoretical results show that an optimal input-dependent fusion weight $\bar{\alpha}^{*}$ exists under a cross-moment condition, yielding lower MSE and improved ranking than MSP alone. Empirically, ACR yields state-of-the-art risk-coverage performance across three AVQA backbones (QA-TIGER, ST-AVQA, TSPM) and across in-distribution, out-of-distribution, and data-bias scenarios on MUSIC-AVQA variants, with notably large gains in low-risk settings and improved calibration (ECE). The work establishes a solid foundation for reliable multimodal reasoning in AVQA and demonstrates how abstention can be effectively integrated with sophisticated confidence refinement to produce trustworthy AI assistants in assistive and multimedia contexts.

Abstract

We present a formal problem formulation for \textit{Reliable} Audio-Visual Question Answering ($\mathcal{R}$-AVQA), where we prefer abstention over answering incorrectly. While recent AVQA models have high accuracy, their ability to identify when they are likely wrong and their consequent abstention from answering remain underexplored areas of research. To fill this gap, we explore several approaches and then propose Adaptive Confidence Refinement (ACR), a lightweight method to further enhance the performance of $\mathcal{R}$-AVQA. Our key insight is that the Maximum Softmax Probability (MSP) is Bayes-optimal only under strong calibration, a condition usually not met in deep neural networks, particularly in multimodal models. Instead of replacing MSP, our ACR maintains it as a primary confidence signal and applies input-adaptive residual corrections when MSP is deemed unreliable. ACR introduces two learned heads: i) a Residual Risk Head that predicts low-magnitude correctness residuals that MSP does not capture, and ii) a Confidence Gating Head to determine MSP trustworthiness. Our experiments and theoretical analysis show that ACR consistently outperforms existing methods on in- and out-of-disrtibution, and data bias settings across three different AVQA architectures, establishing a solid foundation for $\mathcal{R}$-AVQA task. The code and checkpoints will be available upon acceptance \href{https://github.com/PhuTran1005/R-AVQA}{at here}

Knowing When to Answer: Adaptive Confidence Refinement for Reliable Audio-Visual Question Answering

TL;DR

This paper reframes audio-visual question answering as a reliable task (

-AVQA) where abstention is preferred over incorrect answers. It introduces Adaptive Confidence Refinement (ACR), a lightweight, input-adaptive confidence framework that preserves the strong baseline MSP while adding a Residual Risk Head and a Confidence Gating Head to refine uncertainty using multimodal features; the final confidence is

. Through two-stage training, ACR learns to approximate the posterior probability of correctness

, and theoretical results show that an optimal input-dependent fusion weight

exists under a cross-moment condition, yielding lower MSE and improved ranking than MSP alone. Empirically, ACR yields state-of-the-art risk-coverage performance across three AVQA backbones (QA-TIGER, ST-AVQA, TSPM) and across in-distribution, out-of-distribution, and data-bias scenarios on MUSIC-AVQA variants, with notably large gains in low-risk settings and improved calibration (ECE). The work establishes a solid foundation for reliable multimodal reasoning in AVQA and demonstrates how abstention can be effectively integrated with sophisticated confidence refinement to produce trustworthy AI assistants in assistive and multimedia contexts.

Abstract

We present a formal problem formulation for \textit{Reliable} Audio-Visual Question Answering (

-AVQA), where we prefer abstention over answering incorrectly. While recent AVQA models have high accuracy, their ability to identify when they are likely wrong and their consequent abstention from answering remain underexplored areas of research. To fill this gap, we explore several approaches and then propose Adaptive Confidence Refinement (ACR), a lightweight method to further enhance the performance of

-AVQA. Our key insight is that the Maximum Softmax Probability (MSP) is Bayes-optimal only under strong calibration, a condition usually not met in deep neural networks, particularly in multimodal models. Instead of replacing MSP, our ACR maintains it as a primary confidence signal and applies input-adaptive residual corrections when MSP is deemed unreliable. ACR introduces two learned heads: i) a Residual Risk Head that predicts low-magnitude correctness residuals that MSP does not capture, and ii) a Confidence Gating Head to determine MSP trustworthiness. Our experiments and theoretical analysis show that ACR consistently outperforms existing methods on in- and out-of-disrtibution, and data bias settings across three different AVQA architectures, establishing a solid foundation for

-AVQA task. The code and checkpoints will be available upon acceptance \href{https://github.com/PhuTran1005/R-AVQA}{at here}

Paper Structure (39 sections, 7 theorems, 41 equations, 8 figures, 20 tables, 2 algorithms)

This paper contains 39 sections, 7 theorems, 41 equations, 8 figures, 20 tables, 2 algorithms.

Introduction
Related Work
Reliable Audio-Visual Question Answering
Problem Definition and Notation
Evaluation Metrics
Baseline for Selection Functions
Adaptive Confidence Refinement
Experiments
Datasets and Baselines
Benchmarking Evaluation Metrics
Qualitative Analysis
Ablation Study
Discussion and Conclusion
Proofs
Proof of Theorem \ref{['thm:msp_optimal']}
...and 24 more sections

Key Result

Proposition 3.2

Let $g^*(\boldsymbol{x}) = \mathbb{P}[c(\boldsymbol{x}) = 1 \mid \boldsymbol{x}]$ be the Bayes-optimal selection function. For any confidence estimator $s(\boldsymbol{x})$, define the mean squared error as $\mathrm{MSE}(s) = \mathbb{E}\bigl[\bigl(s(\boldsymbol{x}) - c(\boldsymbol{x}) \bigl)^2 \bigl] Consequently, ranking by $s_1$ more closely approximates the Bayes-optimal ranking induced by $g^*$

Figures (8)

Figure 1: $\mathcal{R}$-AVQA requires knowing when to answer. In an illustrative example from MUSIC-AVQA dataset, while standard confidence-based baselines (MSP, MCD, VS, and Doctor) produce incorrect answers with high confidence, our ACR effectively identifies unreliable predictions with a lower confidence score.
Figure 2: Overview of Adaptive Confidence Refinement framework for $\mathcal{R}$-AVQA. Stage 1 trains a standard AVQA model using cross-entropy loss. Stage 2 freezes the AVQA model and trains two additional heads (purple blocks) to achieve a more reliable input-adaptive confidence using binary correctness supervision.
Figure 3: Risk-coverage curves of various selection functions for QA-TIGER (top row) and TSPM (bottom row) up to 10% risk.
Figure 4: Qualitative examples of selective prediction based on QA-TIGER backbone. Examples in which standard confidence-based methods such as MSP, MCD, and VS produce overconfident but incorrect answers, whereas our proposed ACR correctly abstains by identifying unreliable predictions. Confidence scores at different target risk levels are shown in parentheses.
Figure 5: Distribution analysis of MSP and our ACR with TSPM models on MUSIC-AVQA dataset.
...and 3 more figures

Theorems & Definitions (12)

Definition 3.1: Calibration
Proposition 3.2: MSE Reduction Implies Ranking Improvement
Remark 3.1
Theorem 3.3: MSP Optimality under Strong Calibration
Definition 3.4: Error Moments
Theorem 3.5: Fusion Benefit for Confidence Estimation
Theorem 3.6: Optimality of Input-Adaptive Confidence Fusion
Theorem 1.2: MSP Optimality under Strong Calibration
proof
Theorem 1.3: Fusion Benefit for Confidence Estimation
...and 2 more

Knowing When to Answer: Adaptive Confidence Refinement for Reliable Audio-Visual Question Answering

TL;DR

Abstract

Knowing When to Answer: Adaptive Confidence Refinement for Reliable Audio-Visual Question Answering

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (12)