Overcoming Language Priors for Visual Question Answering Based on Knowledge Distillation
Daowan Peng, Wei Wei
TL;DR
This work addresses the problem that visual question answering (VQA) models rely on language priors, harming generalization to out-of-distribution data. It introduces KDAR, a knowledge-distillation-based framework that uses soft labels from a debiased teacher to regularize learning and an adaptive sample-wise reweighting scheme to balance head and tail samples. The method optimizes a joint objective $L_{total} = L_{apt} + \beta L_{kd}$, where $L_{apt} = \left(1-\exp\left(-\frac{\log(p^t_\tau)}{\log(p^s_\tau)}\right)\right) \mathcal{L}_{bce}$ and $L_{kd}$ derives from a mixed-label formulation $y'=(1-\alpha)y+\alpha p^t_\tau$, promoting regularization and semantically informed supervision. Empirically, KDAR achieves state-of-the-art results on VQA-CPv2 (e.g., 71.33% with LXMERT) and improves IID performance on VQAv2, demonstrating strong generalization and practical impact for debiasing multimodal reasoning systems.
Abstract
Previous studies have pointed out that visual question answering (VQA) models are prone to relying on language priors for answer predictions. In this context, predictions often depend on linguistic shortcuts rather than a comprehensive grasp of multimodal knowledge, which diminishes their generalization ability. In this paper, we propose a novel method, namely, KDAR, leveraging knowledge distillation to address the prior-dependency dilemmas within the VQA task. Specifically, the regularization effect facilitated by soft labels from a well-trained teacher is employed to penalize overfitting to the most common answers. The soft labels, which serve a regularization role, also provide semantic guidance that narrows the range of candidate answers. Additionally, we design an adaptive sample-wise reweighting learning strategy to further mitigate bias by dynamically adjusting the importance of each sample. Experimental results demonstrate that our method enhances performance in both OOD and IID settings. Our method achieves state-of-the-art performance on the VQA-CPv2 out-of-distribution (OOD) benchmark, significantly outperforming previous state-of-the-art approaches.
