Table of Contents
Fetching ...

VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding

Yizhou Wang, Ruiyi Zhang, Haoliang Wang, Uttaran Bhattacharya, Yun Fu, Gang Wu

TL;DR

VaQuitA tackles the challenge of aligning video content with text in LLM-based video understanding by shifting beyond uniform frame sampling and simple projection. It introduces a CLIP-score guided Data Alignment scheme, a trainable Video Perceiver for compact video representations, and a Visual-Query Transformer (VQ-Former) that fuses video features with the textual query before feeding into an LLM; a simple prompt, 'Please be critical', further enhances reasoning. End-to-end training tunes only the trainable components while freezing the CLIP and LLM backbones, leading to state-of-the-art zero-shot results on MSVD-QA, MSRVTT-QA, and Activity Net-QA, and enabling high-quality multi-turn video dialogues. The approach demonstrates the practical impact of targeted data and feature alignment and prompt engineering for robust, interactive video understanding in real-world settings.

Abstract

Recent advancements in language-model-based video understanding have been progressing at a remarkable pace, spurred by the introduction of Large Language Models (LLMs). However, the focus of prior research has been predominantly on devising a projection layer that maps video features to tokens, an approach that is both rudimentary and inefficient. In our study, we introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings, which enables a more aligned selection of frames with the given question. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer (abbreviated as VQ-Former), which bolsters the interplay between the input question and the video features. We also discover that incorporating a simple prompt, "Please be critical", into the LLM input can substantially enhance its video comprehension capabilities. Our experimental results indicate that VaQuitA consistently sets a new benchmark for zero-shot video question-answering tasks and is adept at producing high-quality, multi-turn video dialogues with users.

VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding

TL;DR

VaQuitA tackles the challenge of aligning video content with text in LLM-based video understanding by shifting beyond uniform frame sampling and simple projection. It introduces a CLIP-score guided Data Alignment scheme, a trainable Video Perceiver for compact video representations, and a Visual-Query Transformer (VQ-Former) that fuses video features with the textual query before feeding into an LLM; a simple prompt, 'Please be critical', further enhances reasoning. End-to-end training tunes only the trainable components while freezing the CLIP and LLM backbones, leading to state-of-the-art zero-shot results on MSVD-QA, MSRVTT-QA, and Activity Net-QA, and enabling high-quality multi-turn video dialogues. The approach demonstrates the practical impact of targeted data and feature alignment and prompt engineering for robust, interactive video understanding in real-world settings.

Abstract

Recent advancements in language-model-based video understanding have been progressing at a remarkable pace, spurred by the introduction of Large Language Models (LLMs). However, the focus of prior research has been predominantly on devising a projection layer that maps video features to tokens, an approach that is both rudimentary and inefficient. In our study, we introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings, which enables a more aligned selection of frames with the given question. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer (abbreviated as VQ-Former), which bolsters the interplay between the input question and the video features. We also discover that incorporating a simple prompt, "Please be critical", into the LLM input can substantially enhance its video comprehension capabilities. Our experimental results indicate that VaQuitA consistently sets a new benchmark for zero-shot video question-answering tasks and is adept at producing high-quality, multi-turn video dialogues with users.
Paper Structure (28 sections, 5 equations, 7 figures, 2 tables)

This paper contains 28 sections, 5 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Accuracy ($\uparrow$) and relative score ($\uparrow$) comparison on MSVD-QA dataset of current state-of-the-art LLM-based zero-shot video understanding models (evaluated using GPT-3.5-turbo API). VaQuitA achieves the best performance in both evaluation metrics. Please check more details in Sec. \ref{['sec: exp-zsvqa']}.
  • Figure 2: Framework overview. In response to a specific question, our framework begins by processing the input video with a sampling module that identifies key frames based on their relevance to the question's context. These frames are then processed by a pre-trained visual encoder to obtain spatio-temporal features. These features are subsequently refined into condensed embeddings by our newly developed Video Perceiver. In parallel, the question undergoes tokenization. Both the video and text embeddings are then synergized using our Visual-Query Transformer, which aligns the multimodal information more effectively. The resulting text-influenced video features are concatenated with the text embeddings and fed into the Large Language Model to generate the answer. During the testing phase, we propose to add an additional prompt, "Please be critical", before the question for performance enhancement. The whole proposed VaQuitA framework supports end-to-end training. Best viewed in color.
  • Figure 3: Data alignment. Our proposed sampling module consists of both uniform sampling and similarity-based sampling for the training process. Best viewed in color.
  • Figure 4: Feature alignment. The extracted spatio-temporal features of the video clip first go through Video Perceiver for representative embedding extraction, and are afterwards sent to Visual-Query Transformer for interleaving with text embeddings.
  • Figure 5: Multi-round conversations of VaQuitA and Video-ChatGPT maaz2023video on two video samples of ActivityNet-200 caba2015activitynet dataset.
  • ...and 2 more figures