Table of Contents
Fetching ...

Uncertainty-Guided Self-Questioning and Answering for Video-Language Alignment

Jin Chen, Kaijing Ma, Haojian Huang, Han Fang, Hao Sun, Mehdi Hosseinzadeh, Zhe Liu

TL;DR

BoViLA tackles expensive video-text annotation in VideoQA by training a single model to self-generate questions and answers, thereby expanding training data through LLM-based bootstrapping. It introduces an uncertainty-aware filter based on Evidential Deep Learning to prune low-quality self-generated questions, improving modality alignment while keeping training end-to-end. The framework achieves strong results on five VideoQA benchmarks with a small number of trainable parameters and provides extensive ablations and analyses of its components. This approach offers a data-efficient path to leveraging rich video content and LLM priors for robust video-language alignment in multimodal systems.

Abstract

The development of multi-modal models has been rapidly advancing, with some demonstrating remarkable capabilities. However, annotating video-text pairs remains expensive and insufficient. Take video question answering (VideoQA) tasks as an example, human annotated questions and answers often cover only part of the video, since the corresponding text is often short and monotonous, leading to underutilization of video. To address this, we propose a Bootstrapping Video-Language Alignment framework (BoViLA), a self-training method that augments question samples during training process through LLM-based self-questioning and answering, which help model exploit video information and the internal knowledge of LLMs more thoroughly to improve modality alignment. However, low-quality self-generated questions may instead contaminate the performance, especially in the early stages of training, as we have observed in our experiments. To filter bad self-generated questions, we introduce Evidential Deep Learning (EDL) to estimate uncertainty and assess the quality of self-generated questions by evaluating the modality alignment within the context. To the best of our knowledge, this work is the first to explore LLM-based self-training frameworks for modality alignment. We evaluate BoViLA on five strong VideoQA benchmarks, where it outperforms several state-of-the-art methods and demonstrate its effectiveness and generality. Additionally, we provide extensive analyses of the self-training framework and the EDL-based uncertainty filtering mechanism. The code will be made available.

Uncertainty-Guided Self-Questioning and Answering for Video-Language Alignment

TL;DR

BoViLA tackles expensive video-text annotation in VideoQA by training a single model to self-generate questions and answers, thereby expanding training data through LLM-based bootstrapping. It introduces an uncertainty-aware filter based on Evidential Deep Learning to prune low-quality self-generated questions, improving modality alignment while keeping training end-to-end. The framework achieves strong results on five VideoQA benchmarks with a small number of trainable parameters and provides extensive ablations and analyses of its components. This approach offers a data-efficient path to leveraging rich video content and LLM priors for robust video-language alignment in multimodal systems.

Abstract

The development of multi-modal models has been rapidly advancing, with some demonstrating remarkable capabilities. However, annotating video-text pairs remains expensive and insufficient. Take video question answering (VideoQA) tasks as an example, human annotated questions and answers often cover only part of the video, since the corresponding text is often short and monotonous, leading to underutilization of video. To address this, we propose a Bootstrapping Video-Language Alignment framework (BoViLA), a self-training method that augments question samples during training process through LLM-based self-questioning and answering, which help model exploit video information and the internal knowledge of LLMs more thoroughly to improve modality alignment. However, low-quality self-generated questions may instead contaminate the performance, especially in the early stages of training, as we have observed in our experiments. To filter bad self-generated questions, we introduce Evidential Deep Learning (EDL) to estimate uncertainty and assess the quality of self-generated questions by evaluating the modality alignment within the context. To the best of our knowledge, this work is the first to explore LLM-based self-training frameworks for modality alignment. We evaluate BoViLA on five strong VideoQA benchmarks, where it outperforms several state-of-the-art methods and demonstrate its effectiveness and generality. Additionally, we provide extensive analyses of the self-training framework and the EDL-based uncertainty filtering mechanism. The code will be made available.
Paper Structure (26 sections, 19 equations, 7 figures, 4 tables)

This paper contains 26 sections, 19 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Framework overview. The model plays the roles of both questioner and answerer. As a questioner, the model generates new questions based on the video, answer and seed question. As an answerer, the model endeavors to predict the answer from its own generated questions based on the video. Low-quality self-generated questions are filtered by an EDL-based filter to ensure that the knowledge received by the answerer is correct.
  • Figure 2: Model overview. Our model acts as both questioner and answerer. During the forward pass, the questioner generates new questions from the seed question, which are then used as input for the answerer. Green elements and dashed arrows are associated with questioner, while blue elements and solid arrows pertain to answerer. In the backward pass, the answerer backpropagates gradients from the self-generated questions to the questioner, as shown by the red arrows. The self-generated questions are constrained by regularization and EDL-based filter. Steps 1-11 illustrate the BoViLA workflow, detailing the question-answer bootstrapping process.
  • Figure 3: Comparison on five challenging VideoQA benchmarks with both LLMs-based and non-LLMs-based baselines. STAR contains four question types: Int.(interaction), Seq.(sequence), Pre.(prediction), and Fea.(feasibility). * denotes that we do not use the speech captions. Total accuracy is highlighted in green. The best results in each column are highlighted in bold, while the second-best results are underlined, to clearly indicate the model's performance rankings across different datasets.
  • Figure 4: Examples of vanilla self-generated questions (degenerate question), self-generated questions with regularization term, and the corresponding EDL-estimated uncertainty.
  • Figure 5: Correlation between EDL-estimated uncertainty and the quality of self-generated questions. We use $\mathcal{L}_{\mathrm{v\overline{q}a}}$ and $\mathcal{L}_{\mathrm{reg}}$ to approximately represent the quality of self-generated questions. To conduct a clearer correlation analysis, we individually apply the Min-Max normalization to the uncertainty, $\mathcal{L}_{\mathrm{v\overline{q}a}}$ and $\mathcal{L}_{\mathrm{reg}}$, scaling them to the range of 0-1.
  • ...and 2 more figures