Table of Contents
Fetching ...

Text-Conditioned Resampler For Long Form Video Understanding

Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, Federico Tombari

TL;DR

The paper tackles the bottleneck of applying large visual-language models to long-form videos by introducing Text-Conditioned Resampler (TCR), a lightweight transformer-based adapter that selects task-relevant visual features conditioned on text and passes a fixed-length embedding to a frozen LLM. It trains TCR in three stages (initialization, LLM-aligned pre-training, and task-specific fine-tuning) while keeping the visual encoder and LLM frozen, enabling efficient processing of 100+ frames. Empirically, TCR improves performance across NextQA, EgoSchema, and Ego4D challenges (including LTA and MQ), achieving new state-of-the-art results on several long-video tasks and providing insight into how temporal span and frame density affect downstream reasoning. Overall, TCR offers a scalable, resource-efficient path to long-duration video reasoning with VLMs, expanding practical applications in video QA and egocentric understanding.

Abstract

In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time with plain attention and without optimised implementations. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we identify tasks that could benefit from longer video perception; and (iii) we empirically validate its efficacy on a wide variety of evaluation tasks including NextQA, EgoSchema, and the EGO4D-LTA challenge.

Text-Conditioned Resampler For Long Form Video Understanding

TL;DR

The paper tackles the bottleneck of applying large visual-language models to long-form videos by introducing Text-Conditioned Resampler (TCR), a lightweight transformer-based adapter that selects task-relevant visual features conditioned on text and passes a fixed-length embedding to a frozen LLM. It trains TCR in three stages (initialization, LLM-aligned pre-training, and task-specific fine-tuning) while keeping the visual encoder and LLM frozen, enabling efficient processing of 100+ frames. Empirically, TCR improves performance across NextQA, EgoSchema, and Ego4D challenges (including LTA and MQ), achieving new state-of-the-art results on several long-video tasks and providing insight into how temporal span and frame density affect downstream reasoning. Overall, TCR offers a scalable, resource-efficient path to long-duration video reasoning with VLMs, expanding practical applications in video QA and egocentric understanding.

Abstract

In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time with plain attention and without optimised implementations. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we identify tasks that could benefit from longer video perception; and (iii) we empirically validate its efficacy on a wide variety of evaluation tasks including NextQA, EgoSchema, and the EGO4D-LTA challenge.
Paper Structure (24 sections, 5 figures, 11 tables)

This paper contains 24 sections, 5 figures, 11 tables.

Figures (5)

  • Figure 1: TCR resamples visual features that are relevant for the downstream tasks before passing them to the LLM. A qualitative example can be seen on the right.
  • Figure 2: Left: overview of how TCR integrates in a VLM in order to process long videos. A long (30-120 frames) sequence from a visual encoder (V) is resampled to a fixed-length sequence fed to a language model. [CPN] indicates special token for captioning; [7][9] is a representation of tokenised time steps. Right: details of the TCR module. Elements in blue are kept frozen. Best viewed in colour.
  • Figure 3: Performance vs number of frames utilised by the models on various different tasks. $t$ denotes average length of the video in the dataset. 'tcr'=Ours, 'iv'=IntenVideo wang2022internvideo, 'TF'=TimesFormer bertasius2021space, 'b2'=BLIP2 li_blip-2_2023, 'sevilla'= yu2023sevilla, RepNet= dwibedi2020counting
  • Figure 4: Examples of our model responding to various textual prompts taken from NextQA, EGO4D-MR, and YTT datasets. The opacity of the images in the second row is correlated to the mean patch attention score for that frame. Note that frames are subsampled and the TCR conditioning is not included for clarity.
  • Figure 5: Pre-training task examples with a condition sequence (TCR input) and the context sequence (LLM input). [CPN] (captioning), [TRG] (temporal grounding) are special tokens, and [6] is a sample of a tokenised timestamp. We use [MASK] (masking) token to form LLM prompts where applicable as they are integral part of T5's training raffel2020exploring. The model is tasked with predicting a caption "You can see my Nikon camera is in here" which is happening between 6th and 8th second of a video.