VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding
Yizhou Wang, Ruiyi Zhang, Haoliang Wang, Uttaran Bhattacharya, Yun Fu, Gang Wu
TL;DR
VaQuitA tackles the challenge of aligning video content with text in LLM-based video understanding by shifting beyond uniform frame sampling and simple projection. It introduces a CLIP-score guided Data Alignment scheme, a trainable Video Perceiver for compact video representations, and a Visual-Query Transformer (VQ-Former) that fuses video features with the textual query before feeding into an LLM; a simple prompt, 'Please be critical', further enhances reasoning. End-to-end training tunes only the trainable components while freezing the CLIP and LLM backbones, leading to state-of-the-art zero-shot results on MSVD-QA, MSRVTT-QA, and Activity Net-QA, and enabling high-quality multi-turn video dialogues. The approach demonstrates the practical impact of targeted data and feature alignment and prompt engineering for robust, interactive video understanding in real-world settings.
Abstract
Recent advancements in language-model-based video understanding have been progressing at a remarkable pace, spurred by the introduction of Large Language Models (LLMs). However, the focus of prior research has been predominantly on devising a projection layer that maps video features to tokens, an approach that is both rudimentary and inefficient. In our study, we introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings, which enables a more aligned selection of frames with the given question. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer (abbreviated as VQ-Former), which bolsters the interplay between the input question and the video features. We also discover that incorporating a simple prompt, "Please be critical", into the LLM input can substantially enhance its video comprehension capabilities. Our experimental results indicate that VaQuitA consistently sets a new benchmark for zero-shot video question-answering tasks and is adept at producing high-quality, multi-turn video dialogues with users.
