Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels

Tianming Liang; Chaolei Tan; Beihao Xia; Wei-Shi Zheng; Jian-Fang Hu

Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels

Tianming Liang, Chaolei Tan, Beihao Xia, Wei-Shi Zheng, Jian-Fang Hu

TL;DR

This work introduces a simple yet effective ranking distillation framework (RADI) that employs a teacher model trained with incomplete labels to generate rankings for potential answers, which contain rich knowledge about label priority as well as label-associated visual cues, thereby enriching the insufficient labeling information.

Abstract

This paper focuses on open-ended video question answering, which aims to find the correct answers from a large answer set in response to a video-related question. This is essentially a multi-label classification task, since a question may have multiple answers. However, due to annotation costs, the labels in existing benchmarks are always extremely insufficient, typically one answer per question. As a result, existing works tend to directly treat all the unlabeled answers as negative labels, leading to limited ability for generalization. In this work, we introduce a simple yet effective ranking distillation framework (RADI) to mitigate this problem without additional manual annotation. RADI employs a teacher model trained with incomplete labels to generate rankings for potential answers, which contain rich knowledge about label priority as well as label-associated visual cues, thereby enriching the insufficient labeling information. To avoid overconfidence in the imperfect teacher model, we further present two robust and parameter-free ranking distillation approaches: a pairwise approach which introduces adaptive soft margins to dynamically refine the optimization constraints on various pairwise rankings, and a listwise approach which adopts sampling-based partial listwise learning to resist the bias in teacher ranking. Extensive experiments on five popular benchmarks consistently show that both our pairwise and listwise RADIs outperform state-of-the-art methods. Further analysis demonstrates the effectiveness of our methods on the insufficient labeling problem.

Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels

TL;DR

Abstract

Paper Structure (13 sections, 9 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 13 sections, 9 equations, 7 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Ranking Distillation for OE-VQA
Overall Framework of RADI
Adaptive Pairwise Ranking Distillation
Partial Listwise Ranking Distillation
Experiments
Experimental Setup
Comparison with State-of-the-arts
Evaluation on Insufficient Labeling Problem
Ablation Study
Qualitative Results
Conclusion

Figures (7)

Figure 1: Two examples about the insufficient labeling problem in MSRVTT-QA dataset. The correct answers to each questions are colored in red. Existing OE-VQA methods tend to directly regard the entire unlabeled set as negative answers.
Figure 2: Comparison between different potential schemes for the insufficient labeling problem, where Model T serves as a teacher model to enrich the label information.
Figure 3: An overview of RADI, which is a LTR-based training framework for OE-VQA. Within RADI, the video-QA model is optimized using two loss functions: (i) classification loss maximizes the prediction probability of labeled answer (i.e., noodles and eggs) and suppress the rest, involving the potential positive answers (e.g., soup and food); (ii) ranking loss may retrieve these potential positive answers, by pushing the predicted ranking to align with the ranking list provided by a well-trained teacher model.
Figure 4: Improvements of using our sampling strategies on various listwise loss functions.
Figure 5: Impacts of using $L_{cls}$, $L_{rank}$ and different $\alpha$ on RADI-P and RADI-L.
...and 2 more figures

Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels

TL;DR

Abstract

Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels

Authors

TL;DR

Abstract

Table of Contents

Figures (7)