Table of Contents
Fetching ...

Long Video Understanding with Learnable Retrieval in Video-Language Models

Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, Yan Lu

TL;DR

The paper tackles the high computational cost and information loss in long-video understanding with LLMs by introducing R-VLM, a retrieval-based framework that selects a small set of question-relevant video chunks. A lightweight MLP augments the CLIP-based representations to perform end-to-end trainable, question-guided chunk retrieval, with a soft matching loss to align retrieved content with the query. By keeping encoders and the LLM frozen and training only the MLP and a projector, the approach achieves strong zero-shot performance on multiple long-video QA benchmarks while significantly reducing the tokens fed to the LLM. Across extensive ablations and visual analyses, the work demonstrates that selective chunk retrieval preserves vision details, reduces noise, and enables efficient, scalable long-video reasoning with LLMs.

Abstract

The remarkable natural language understanding, reasoning, and generation capabilities of large language models (LLMs) have made them attractive for application to video understanding, utilizing video tokens as contextual input. However, employing LLMs for long video understanding presents significant challenges. The extensive number of video tokens leads to considerable computational costs for LLMs while using aggregated tokens results in loss of vision details. Moreover, the presence of abundant question-irrelevant tokens introduces noise to the video reasoning process. To address these issues, we introduce a simple yet effective learnable retrieval-based video-language model (R-VLM) for efficient long video understanding. Specifically, given a question (query) and a long video, our model identifies and selects the most relevant K video chunks and uses their associated visual tokens to serve as context for the LLM inference. This effectively reduces the number of video tokens, eliminates noise interference, and enhances system performance. We achieve this by incorporating a learnable lightweight MLP block to facilitate the efficient retrieval of question-relevant chunks, through the end-to-end training of our video-language model with a proposed soft matching loss. Our experimental results on multiple zero-shot video question answering datasets validate the effectiveness of our framework for comprehending long videos.

Long Video Understanding with Learnable Retrieval in Video-Language Models

TL;DR

The paper tackles the high computational cost and information loss in long-video understanding with LLMs by introducing R-VLM, a retrieval-based framework that selects a small set of question-relevant video chunks. A lightweight MLP augments the CLIP-based representations to perform end-to-end trainable, question-guided chunk retrieval, with a soft matching loss to align retrieved content with the query. By keeping encoders and the LLM frozen and training only the MLP and a projector, the approach achieves strong zero-shot performance on multiple long-video QA benchmarks while significantly reducing the tokens fed to the LLM. Across extensive ablations and visual analyses, the work demonstrates that selective chunk retrieval preserves vision details, reduces noise, and enables efficient, scalable long-video reasoning with LLMs.

Abstract

The remarkable natural language understanding, reasoning, and generation capabilities of large language models (LLMs) have made them attractive for application to video understanding, utilizing video tokens as contextual input. However, employing LLMs for long video understanding presents significant challenges. The extensive number of video tokens leads to considerable computational costs for LLMs while using aggregated tokens results in loss of vision details. Moreover, the presence of abundant question-irrelevant tokens introduces noise to the video reasoning process. To address these issues, we introduce a simple yet effective learnable retrieval-based video-language model (R-VLM) for efficient long video understanding. Specifically, given a question (query) and a long video, our model identifies and selects the most relevant K video chunks and uses their associated visual tokens to serve as context for the LLM inference. This effectively reduces the number of video tokens, eliminates noise interference, and enhances system performance. We achieve this by incorporating a learnable lightweight MLP block to facilitate the efficient retrieval of question-relevant chunks, through the end-to-end training of our video-language model with a proposed soft matching loss. Our experimental results on multiple zero-shot video question answering datasets validate the effectiveness of our framework for comprehending long videos.
Paper Structure (27 sections, 5 equations, 10 figures, 16 tables)

This paper contains 27 sections, 5 equations, 10 figures, 16 tables.

Figures (10)

  • Figure 1: Illustration of our learnable retrieval-based video-language model for efficient long video question answering. We encode an input long video into a sequence of video chunks, with each chunk represented by a set of spatial and temporal visual tokens. Question-guided retrieval is performed to find the top $K$ relevant video chunks, with their tokens as the input to the LLM for answer generation. Here, we use $K=1$ for illustration purpose. A learnable lightweight MLP block (following the text encoder) and the projector are trained end-to-end, where the encoders and LLM are frozen. Soft matching (SM) loss is introduced to regularize the retrieval related learning.
  • Figure 2: Illustration of the spatial and temporal pooling to obtain 68 visual tokens for a chunk. We perform spatial average pooling with stride 2 to have $M$$\times$$\bar{h}$$\times$$\bar{w}$ = $4\times 8 \times 8 = 256$ tokens per chunk, where $\bar{h}$=$h/2$ and $\bar{w}$=$w/2$. This is equivalent to taking the CLIP features of reduced resolution as the extracted feature. Global spatial average pooling for each frame is performed and thus we obtain $M$ tokens for the $M$ frames ($M=4$). Temporal pooling for each spatial position is performed to have $N=\bar{h} \times \bar{w} = 8 \times 8 = 64$ tokens. Therefore, we have $N+M=64+4=68$ tokens for a chunk.
  • Figure 3: Visualization of video QA examples from QAEgo4D. The kitchen towel related to the question does not appear in the uniformly sampled video chunks. The second chunk selected by our model contains kitchen towel. Our answer states that the towel is hunging on a hook on the wall. Video-LLaMA answers incorrectly, where the towel does not appear in the first frame of the video, and it is not be placed on the countertop in front of a cutting board. Note that we use a sampled frame in a chunk to illustrate the chunk in the first two rows. GPT-4o correctly points out where the towel is.
  • Figure 5: Visualization of a video QA example from QAEgo4D. We can see that the gray car does not appear in the uniformly sampled video chunks. Our R-VLM correctly answers that the car was parked in the parking lot (outdoors), but R-VLM w/Uni.’s answer was the garage (indoors). Video-LLaMA does not answer where the car is and the ground-truth frames do not appear in the frame 100. Video-ChatGPT made the similar mistake as R-VLM w/Uni. GPT-4o had an illusion that the car was parked on a gravel road, but the road was flat, so it gave an inaccurate answer.
  • Figure 6: Visualization of video QA examples from WildQA. In this video, two clips show vegetation and the remaining clips show mountains, rivers, etc. Uniform sampling mainly obtains segments such as mountains and rivers rather than segments with vegetation. Therefore, only the terrain was answered, without giving vegetation types. In contrast, our retrieved chunks contain video clips of vegetation. Thus the types of vegetation are predicted correctly: trees, bushes, forest. Video-ChatGPT gives a global description and does not answer specific vegetation types. GPT-4o gave correct answer.
  • ...and 5 more figures