Overcoming Language Priors for Visual Question Answering Based on Knowledge Distillation

Daowan Peng; Wei Wei

Overcoming Language Priors for Visual Question Answering Based on Knowledge Distillation

Daowan Peng, Wei Wei

TL;DR

This work addresses the problem that visual question answering (VQA) models rely on language priors, harming generalization to out-of-distribution data. It introduces KDAR, a knowledge-distillation-based framework that uses soft labels from a debiased teacher to regularize learning and an adaptive sample-wise reweighting scheme to balance head and tail samples. The method optimizes a joint objective $L_{total} = L_{apt} + \beta L_{kd}$, where $L_{apt} = \left(1-\exp\left(-\frac{\log(p^t_\tau)}{\log(p^s_\tau)}\right)\right) \mathcal{L}_{bce}$ and $L_{kd}$ derives from a mixed-label formulation $y'=(1-\alpha)y+\alpha p^t_\tau$, promoting regularization and semantically informed supervision. Empirically, KDAR achieves state-of-the-art results on VQA-CPv2 (e.g., 71.33% with LXMERT) and improves IID performance on VQAv2, demonstrating strong generalization and practical impact for debiasing multimodal reasoning systems.

Abstract

Previous studies have pointed out that visual question answering (VQA) models are prone to relying on language priors for answer predictions. In this context, predictions often depend on linguistic shortcuts rather than a comprehensive grasp of multimodal knowledge, which diminishes their generalization ability. In this paper, we propose a novel method, namely, KDAR, leveraging knowledge distillation to address the prior-dependency dilemmas within the VQA task. Specifically, the regularization effect facilitated by soft labels from a well-trained teacher is employed to penalize overfitting to the most common answers. The soft labels, which serve a regularization role, also provide semantic guidance that narrows the range of candidate answers. Additionally, we design an adaptive sample-wise reweighting learning strategy to further mitigate bias by dynamically adjusting the importance of each sample. Experimental results demonstrate that our method enhances performance in both OOD and IID settings. Our method achieves state-of-the-art performance on the VQA-CPv2 out-of-distribution (OOD) benchmark, significantly outperforming previous state-of-the-art approaches.

Overcoming Language Priors for Visual Question Answering Based on Knowledge Distillation

TL;DR

, where

and

derives from a mixed-label formulation

, promoting regularization and semantically informed supervision. Empirically, KDAR achieves state-of-the-art results on VQA-CPv2 (e.g., 71.33% with LXMERT) and improves IID performance on VQAv2, demonstrating strong generalization and practical impact for debiasing multimodal reasoning systems.

Abstract

Paper Structure (19 sections, 8 equations, 3 figures, 4 tables)

This paper contains 19 sections, 8 equations, 3 figures, 4 tables.

Introduction
Related Works
Augmentation-based methods
Non-augmentation-based methods
Methodology
Preliminaries
The Paradigm of VQA
The Paradigm of Knowledge Distillation
KDAR Method
Knowledge Distillation Learning Strategy
Adaptive Reweighting Learning Strategy
Learning Objective
Experiments
Experimental Setting
Experimental Results
...and 4 more sections

Figures (3)

Figure 1: The previous VQA models tend to predict the most frequent answers in training dataset based on prior knowledge of question-answer pairs, rather a fine-grained understanding of both images and questions.
Figure 2: The training pipeline of our method. The image-question pairs are fed to a well-trained teacher VQA (T-VQA) model and the student VQA (S-VQA) model. Then the softmax function is used to the logits $Z^t$ of the teacher model and $Z^s$ of the student model to compute the probabilities at a temperature $\tau$. Thereafter, the student model needs to minimize the total loss. Note that the parameters of the T-VQA will be frozen during training stage.
Figure 3: Effect of the hyperparameters scalar weight $\beta$ and temperature $\tau$. The $x$-axis measures the temperature and the $y$-axis quantifies the accuracy.

Overcoming Language Priors for Visual Question Answering Based on Knowledge Distillation

TL;DR

Abstract

Overcoming Language Priors for Visual Question Answering Based on Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)