Text-Conditioned Resampler For Long Form Video Understanding
Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, Federico Tombari
TL;DR
The paper tackles the bottleneck of applying large visual-language models to long-form videos by introducing Text-Conditioned Resampler (TCR), a lightweight transformer-based adapter that selects task-relevant visual features conditioned on text and passes a fixed-length embedding to a frozen LLM. It trains TCR in three stages (initialization, LLM-aligned pre-training, and task-specific fine-tuning) while keeping the visual encoder and LLM frozen, enabling efficient processing of 100+ frames. Empirically, TCR improves performance across NextQA, EgoSchema, and Ego4D challenges (including LTA and MQ), achieving new state-of-the-art results on several long-video tasks and providing insight into how temporal span and frame density affect downstream reasoning. Overall, TCR offers a scalable, resource-efficient path to long-duration video reasoning with VLMs, expanding practical applications in video QA and egocentric understanding.
Abstract
In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time with plain attention and without optimised implementations. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we identify tasks that could benefit from longer video perception; and (iii) we empirically validate its efficacy on a wide variety of evaluation tasks including NextQA, EgoSchema, and the EGO4D-LTA challenge.
