Table of Contents
Fetching ...

ViLA: Efficient Video-Language Alignment for Video Question Answering

Xijun Wang, Junbang Liang, Chun-Kai Wang, Kenan Deng, Yu Lou, Ming Lin, Shan Yang

TL;DR

The ViLA model addresses both efficient frame sampling and effective cross-modal alignment in a unified way, and designs a new learnable text-guided Frame-Prompter together with a new cross-modal distillation (QFormer-Distiller) module.

Abstract

In this work, we propose an efficient Video-Language Alignment (ViLA) network. Our ViLA model addresses both efficient frame sampling and effective cross-modal alignment in a unified way. In our ViLA network, we design a new learnable text-guided Frame-Prompter together with a new cross-modal distillation (QFormer-Distiller) module. Pre-trained large image-language models have shown promising results on problems such as visual question answering (VQA). However, how to efficiently and effectively sample video frames when adapting pre-trained large image-language model to video-language alignment is still the major challenge. Compared with prior work, our ViLA model demonstrates the capability of selecting key frames with critical contents, thus improving the video-language alignment accuracy while reducing the inference latency +3.3% on NExT-QA Temporal with 3.0X speed up). Overall, our ViLA network outperforms the state-of-the-art methods on the video question-answering benchmarks: +4.6% on STAR Interaction, +2.2% on STAR average with 3.0X speed up, ours 2-frames out-perform SeViLA 4-frames on the VLEP dataset with 4.2X speed-up. The code will be available at https://github.com/xijun-cs/ViLA.

ViLA: Efficient Video-Language Alignment for Video Question Answering

TL;DR

The ViLA model addresses both efficient frame sampling and effective cross-modal alignment in a unified way, and designs a new learnable text-guided Frame-Prompter together with a new cross-modal distillation (QFormer-Distiller) module.

Abstract

In this work, we propose an efficient Video-Language Alignment (ViLA) network. Our ViLA model addresses both efficient frame sampling and effective cross-modal alignment in a unified way. In our ViLA network, we design a new learnable text-guided Frame-Prompter together with a new cross-modal distillation (QFormer-Distiller) module. Pre-trained large image-language models have shown promising results on problems such as visual question answering (VQA). However, how to efficiently and effectively sample video frames when adapting pre-trained large image-language model to video-language alignment is still the major challenge. Compared with prior work, our ViLA model demonstrates the capability of selecting key frames with critical contents, thus improving the video-language alignment accuracy while reducing the inference latency +3.3% on NExT-QA Temporal with 3.0X speed up). Overall, our ViLA network outperforms the state-of-the-art methods on the video question-answering benchmarks: +4.6% on STAR Interaction, +2.2% on STAR average with 3.0X speed up, ours 2-frames out-perform SeViLA 4-frames on the VLEP dataset with 4.2X speed-up. The code will be available at https://github.com/xijun-cs/ViLA.
Paper Structure (24 sections, 10 equations, 5 figures, 5 tables)

This paper contains 24 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Our efficient Vision-Language Alignment (ViLA) model via Frame-Promper and distilling contains two new modules: a text-guided Frame-Prompter and a cross-modal QFormer-Distiller. It learns to extract the most question-related frames while keeping the inference latency low.
  • Figure 2: Model Overview. Our ViLA model includes 4 sub-modules: the visual encoder, text-supervised Frame-Prompter (FP), QFormer-Distiller (QFD), and a LLM. We encode the video frames through a frozen visual encoder. Then we train the Teacher-QFormer using all the frame features. After that, we train the Student-QFormer and Frame-Prompter end-to-end. Unlike the Teacher-QFormer, our Student-QFormer is trained with masked frames features from a text-supervised Frame-Prompter. Finally, the input question text and QFormer transformed visual features go through a frozen large language model to generate the answer. Our network supports both leveraging LLM through proper visual prompting without affecting the original LLM (Frozen) ability on language tasks and finetuning LLMs(LoRA) simultaneously to get optimal performance on specific tasks.
  • Figure 3: Text-guided Frame-Prompter. Here we show the details of our learnable text-guided Frame-Prompter. We design a learnable Frame-Prompter to sample the most text query-related frames, with two design choises (a and b). We choose design (a) for diversified temporal sampling. We first encode the mean-pooled segment features. We then apply the Gumbel Softmax to compute the segment mask to guarantee the differentiability. The selected frames embedding then goes through the QFormer-Distiller. Here $B$ means batch size, $T$ means frame number, $N\times C$ means the frame feature sequences. The Frame-Prompter is learned with the text-supervised gradient. When VQA loss is applied, the input question text-related gradient further flows to the Frame-Prompter. The question text-related gradient guides the Frame-Prompter to select the most critical frames.
  • Figure 4: Key-frame Selection Comparison Results (select 4 frames from 32 frames). We compare frames selected by our ViLA compared with that from the SOTA SeViLA yu2023self method. Across different type of questions, especially the Causal, Temporal type questions, keyframes selected by our network is more relevant and better related to the question.
  • Figure 5: QFormer-Distiller Results Visualization. Here we visualize the keyframes selected after cross-modal distillation. After distillation, we can select the most question-relevant frames even from 16 frames.