Knowing When to Answer: Adaptive Confidence Refinement for Reliable Audio-Visual Question Answering
Dinh Phu Tran, Jihoon Jeong, Saad Wazir, Seongah Kim, Thao Do, Cem Subakan, Daeyoung Kim
TL;DR
This paper reframes audio-visual question answering as a reliable task ($\mathcal{R}$-AVQA) where abstention is preferred over incorrect answers. It introduces Adaptive Confidence Refinement (ACR), a lightweight, input-adaptive confidence framework that preserves the strong baseline MSP while adding a Residual Risk Head and a Confidence Gating Head to refine uncertainty using multimodal features; the final confidence is $C_{\mathrm{ACR}}(\boldsymbol{x}) = \alpha(\boldsymbol{x})C_{\mathrm{M}}(\boldsymbol{x}) + (1-\alpha(\boldsymbol{x}))C_{\mathrm{R}}(\boldsymbol{x})$. Through two-stage training, ACR learns to approximate the posterior probability of correctness $\mathbb{P}(c(\boldsymbol{x})=1|\boldsymbol{x})$, and theoretical results show that an optimal input-dependent fusion weight $\bar{\alpha}^{*}$ exists under a cross-moment condition, yielding lower MSE and improved ranking than MSP alone. Empirically, ACR yields state-of-the-art risk-coverage performance across three AVQA backbones (QA-TIGER, ST-AVQA, TSPM) and across in-distribution, out-of-distribution, and data-bias scenarios on MUSIC-AVQA variants, with notably large gains in low-risk settings and improved calibration (ECE). The work establishes a solid foundation for reliable multimodal reasoning in AVQA and demonstrates how abstention can be effectively integrated with sophisticated confidence refinement to produce trustworthy AI assistants in assistive and multimedia contexts.
Abstract
We present a formal problem formulation for \textit{Reliable} Audio-Visual Question Answering ($\mathcal{R}$-AVQA), where we prefer abstention over answering incorrectly. While recent AVQA models have high accuracy, their ability to identify when they are likely wrong and their consequent abstention from answering remain underexplored areas of research. To fill this gap, we explore several approaches and then propose Adaptive Confidence Refinement (ACR), a lightweight method to further enhance the performance of $\mathcal{R}$-AVQA. Our key insight is that the Maximum Softmax Probability (MSP) is Bayes-optimal only under strong calibration, a condition usually not met in deep neural networks, particularly in multimodal models. Instead of replacing MSP, our ACR maintains it as a primary confidence signal and applies input-adaptive residual corrections when MSP is deemed unreliable. ACR introduces two learned heads: i) a Residual Risk Head that predicts low-magnitude correctness residuals that MSP does not capture, and ii) a Confidence Gating Head to determine MSP trustworthiness. Our experiments and theoretical analysis show that ACR consistently outperforms existing methods on in- and out-of-disrtibution, and data bias settings across three different AVQA architectures, establishing a solid foundation for $\mathcal{R}$-AVQA task. The code and checkpoints will be available upon acceptance \href{https://github.com/PhuTran1005/R-AVQA}{at here}
