Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering
Haopeng Li, Mohammed Bennamoun, Jun Liu, Hossein Rahmani, Qiuhong Ke
TL;DR
This work tackles VideoQA generalization by introducing uncertainty-aware curriculum learning (UCL) to progressively train on data of increasing difficulty, measured by data and predictive uncertainty rather than loss alone. It casts VideoQA as a stochastic computation graph, enabling probabilistic modeling of visual representations and deriving two uncertainty types, feature uncertainty $U_F$ and predictive uncertainty $U_P$, to guide training. The approach integrates with MASN and uses sampling-based variational inference to obtain uncertainty-aware predictions, achieving state-of-the-art results on TGIF-QA and NExT-QA, while providing meaningful uncertainty quantification and robustness analyses. The framework demonstrates improved generalization across multiple VideoQA models and datasets, with comprehensive ablations and hyper-parameter studies validating the benefits of probabilistic modeling and uncertainty-guided curriculum scheduling.
Abstract
While significant advancements have been made in video question answering (VideoQA), the potential benefits of enhancing model generalization through tailored difficulty scheduling have been largely overlooked in existing research. This paper seeks to bridge that gap by incorporating VideoQA into a curriculum learning (CL) framework that progressively trains models from simpler to more complex data. Recognizing that conventional self-paced CL methods rely on training loss for difficulty measurement, which might not accurately reflect the intricacies of video-question pairs, we introduce the concept of uncertainty-aware CL. Here, uncertainty serves as the guiding principle for dynamically adjusting the difficulty. Furthermore, we address the challenge posed by uncertainty by presenting a probabilistic modeling approach for VideoQA. Specifically, we conceptualize VideoQA as a stochastic computation graph, where the hidden representations are treated as stochastic variables. This yields two distinct types of uncertainty: one related to the inherent uncertainty in the data and another pertaining to the model's confidence. In practice, we seamlessly integrate the VideoQA model into our framework and conduct comprehensive experiments. The findings affirm that our approach not only achieves enhanced performance but also effectively quantifies uncertainty in the context of VideoQA.
