Table of Contents
Fetching ...

Slot-VLM: SlowFast Slots for Video-Language Modeling

Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, Yan Lu

TL;DR

Slot-VLM tackles the challenge of aligning dense video features with large language models by converting video tokens into a compact set of semantically decoupled tokens. It introduces a SlowFast Slots (SF-Slots) module with two branches: Slow-Slots (object-centric, high spatial resolution at low frame rate) and Fast-Slots (event-centric, high frame rate with reduced spatial resolution), whose slots are projected to an LLM input. The approach uses a frozen CLIP encoder and a frozen Vicuna-7B LLM, with a three-stage training pipeline (slot pre-training, single-branch instruction tuning, and two-branch joint tuning) and achieves state-of-the-art results on three video QA benchmarks with about 100K instruction-tuning examples. This semantic-token strategy improves efficiency and interpretability in video-language reasoning, offering a scalable path toward more robust video understanding in LLM-based systems.

Abstract

Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding. A pivotal challenge is the development of an efficient method to encapsulate video content into a set of representative tokens to align with LLMs. In this work, we introduce Slot-VLM, a novel framework designed to generate semantically decomposed video tokens, in terms of object-wise and event-wise visual representations, to facilitate LLM inference. Particularly, we design a SlowFast Slots module, i.e., SF-Slots, that adaptively aggregates the dense video tokens from the CLIP vision encoder to a set of representative slots. In order to take into account both the spatial object details and the varied temporal dynamics, SF-Slots is built with a dual-branch structure. The Slow-Slots branch focuses on extracting object-centric slots from features at high spatial resolution but low (slow) frame sample rate, emphasizing detailed object information. Conversely, Fast-Slots branch is engineered to learn event-centric slots from high temporal sample rate but low spatial resolution features. These complementary slots are combined to form the vision context, serving as the input to the LLM for efficient question answering. Our experimental results demonstrate the effectiveness of our Slot-VLM, which achieves the state-of-the-art performance on video question-answering.

Slot-VLM: SlowFast Slots for Video-Language Modeling

TL;DR

Slot-VLM tackles the challenge of aligning dense video features with large language models by converting video tokens into a compact set of semantically decoupled tokens. It introduces a SlowFast Slots (SF-Slots) module with two branches: Slow-Slots (object-centric, high spatial resolution at low frame rate) and Fast-Slots (event-centric, high frame rate with reduced spatial resolution), whose slots are projected to an LLM input. The approach uses a frozen CLIP encoder and a frozen Vicuna-7B LLM, with a three-stage training pipeline (slot pre-training, single-branch instruction tuning, and two-branch joint tuning) and achieves state-of-the-art results on three video QA benchmarks with about 100K instruction-tuning examples. This semantic-token strategy improves efficiency and interpretability in video-language reasoning, offering a scalable path toward more robust video understanding in LLM-based systems.

Abstract

Video-Language Models (VLMs), powered by the advancements in Large Language Models (LLMs), are charting new frontiers in video understanding. A pivotal challenge is the development of an efficient method to encapsulate video content into a set of representative tokens to align with LLMs. In this work, we introduce Slot-VLM, a novel framework designed to generate semantically decomposed video tokens, in terms of object-wise and event-wise visual representations, to facilitate LLM inference. Particularly, we design a SlowFast Slots module, i.e., SF-Slots, that adaptively aggregates the dense video tokens from the CLIP vision encoder to a set of representative slots. In order to take into account both the spatial object details and the varied temporal dynamics, SF-Slots is built with a dual-branch structure. The Slow-Slots branch focuses on extracting object-centric slots from features at high spatial resolution but low (slow) frame sample rate, emphasizing detailed object information. Conversely, Fast-Slots branch is engineered to learn event-centric slots from high temporal sample rate but low spatial resolution features. These complementary slots are combined to form the vision context, serving as the input to the LLM for efficient question answering. Our experimental results demonstrate the effectiveness of our Slot-VLM, which achieves the state-of-the-art performance on video question-answering.
Paper Structure (23 sections, 10 figures, 4 tables)

This paper contains 23 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Illustration of methods for aligning visual features with LLM. Previous methods (a) and (b) leverage pooling or Q-Former to aggregate visual tokens, where each generated token contains coupled semantics. In contrast, we propose to generate semantically decoupled object-centric tokens as illustrated in (c), and event-centric tokens as illustrated in (d), to align with the LLM.
  • Figure 2: Flowchart of our proposed Slot-VLM for video understanding. Slot-VLM consists of a frozen image encoder, a learnable SlowFast Slots module (i.e., SF-Slots module), a projection layer, and a frozen LLM. The image encoder (CLIP image encoder) encodes the input video of $T$ frames into a sequence of image features, resulting in extensive ($H\times W \times T$) video tokens. In order to obtain semantically decoupled and compact (reduced) video tokens as the vision context for aligning with LLM, our SlowFast Slots module learns to aggregate those tokens to object-centric tokens and event-centric tokens through the Slow-Slots branch and the Fast-Slots branch, respectively. The Slow-Slots branch operates at low frame rate ($t^d \ll T$) but high spatial resolution in order to capture spatial objects through slot attention on each frame. The Fast-Slot branch operates at high frame rate but low spatial resolution ($M^d= h^d \times w^d, h^d < H$, $w^d < W$) in order to capture temporal dynamics through slot attention over each spatial position. The learned slots (tokens) from two branches are projected by a fully connected layer and input to LLM for video reasoning, together with the text input (text query).
  • Figure 3: Visualization of spatial attention masks from the Slow-Slots branch for two video examples, respectively. We have $t^d=8$ frames as shown in 8 rows, indexed by $i$, where $i=1,\ldots,t^d$, respectively. The first column shows the original frame. The second to the ninth columns show the cross attention mask (from slot attention) for the $N_s=8$ object-centric slots $\mathcal{O}_i = \{\mathbf{o}_{i,1}, \ldots, \mathbf{o}_{i,N_s}\}$. We can see that even though not perfectly segmented, some meaningful slots have been formed. For example, the slots marked by red, purple, green, and blue in the first video (left) correspond to "background", "human body", "head", and "barbell". Note that the slots in a frame is unordered and exchangeable.
  • Figure 4: Visualization of temporal attention mask for $M^d = h^d \times w^d = 16$ spatial positions from (a) our Fast-Slots branch and (b) Fast-QFormer-VLM, respectively. For simplicity, we also refer to slot as query here. For the $k^{th}$ spatial position, we denote the set of learned temporal queries by $\mathcal{E}_k$. Take the $13^{th}$ spatial position of the query set $\mathcal{E}_{13}$ as an example (as marked by red box in (a) and blue box in (b)). For this spatial position, the models generate $N_f=8$ slots/queries by aggregating the temporal visual tokens. The attention masks for $\mathcal{E}_{13}$ are denoted by a map of $T$ rows and $N_f$ columns, with the visibility indicating which queries this temporal position belongs to. The higher the visibility, the greater the affinity between this temporal position and the query. We can see that in our Slot-VLM, similar contents tend to be allocated to the same slot, i.e., different slots capture different contents (events) and present decoupled semantics. In contrast, in Fast-QFormer-VLM, different contents are usually assigned to the same query or are uniformly assigned to different queries. Note that for Fast-QFormer-VLM, we only show the mask of one head to save space, where similar observations can be found from other heads. A glimpse of the original video can be found in Appendix \ref{['subsec:video-glimpse-temp']}. See Figure \ref{['fig:Visualization-T-Sep-Enlarged']} in Appendix \ref{['subsec:video-glimpse-temp']} for the enlarged visualization of $\mathcal{E}_{13}$.
  • Figure 5: Visualization of spatial attention masks from the Q-Former in BLIP2 for two images in (a) and (b) respectively. We show the learned query masks for the 12 heads in 12 rows, respectively. In each row, we show the masks for the 32 queries. Note that the first column show the original image repeated by 12 times. There is no obvious evidence that different queries have learned decoupled semantics.
  • ...and 5 more figures