Slot-VLM: SlowFast Slots for Video-Language Modeling
Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, Yan Lu
TL;DR
Slot-VLM tackles the challenge of aligning dense video features with large language models by converting video tokens into a compact set of semantically decoupled tokens. It introduces a SlowFast Slots (SF-Slots) module with two branches: Slow-Slots (object-centric, high spatial resolution at low frame rate) and Fast-Slots (event-centric, high frame rate with reduced spatial resolution), whose slots are projected to an LLM input. The approach uses a frozen CLIP encoder and a frozen Vicuna-7B LLM, with a three-stage training pipeline (slot pre-training, single-branch instruction tuning, and two-branch joint tuning) and achieves state-of-the-art results on three video QA benchmarks with about 100K instruction-tuning examples. This semantic-token strategy improves efficiency and interpretability in video-language reasoning, offering a scalable path toward more robust video understanding in LLM-based systems.
Abstract
Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding. A pivotal challenge is the development of an efficient method to encapsulate video content into a set of representative tokens to align with LLMs. In this work, we introduce Slot-VLM, a novel framework designed to generate semantically decomposed video tokens, in terms of object-wise and event-wise visual representations, to facilitate LLM inference. Particularly, we design a SlowFast Slots module, i.e., SF-Slots, that adaptively aggregates the dense video tokens from the CLIP vision encoder to a set of representative slots. In order to take into account both the spatial object details and the varied temporal dynamics, SF-Slots is built with a dual-branch structure. The Slow-Slots branch focuses on extracting object-centric slots from features at high spatial resolution but low (slow) frame sample rate, emphasizing detailed object information. Conversely, Fast-Slots branch is engineered to learn event-centric slots from high temporal sample rate but low spatial resolution features. These complementary slots are combined to form the vision context, serving as the input to the LLM for efficient question answering. Our experimental results demonstrate the effectiveness of our Slot-VLM, which achieves the state-of-the-art performance on video question-answering.
