TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

Mingze Gao; Jingyu Liu; Mingda Li; Jiangtao Xie; Qingbin Liu; Bo Zhao; Xi Chen; Hui Xiong

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

Mingze Gao, Jingyu Liu, Mingda Li, Jiangtao Xie, Qingbin Liu, Bo Zhao, Xi Chen, Hui Xiong

TL;DR

This paper proposes two strategies to enhance the model's capability in video understanding tasks by improving inter-layer attention computation in LLMs and adapt LLaVA for video understanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA).

Abstract

Multimodal Large Language Models (MLLMs) have significantly improved performance across various image-language applications. Recently, there has been a growing interest in adapting image pre-trained MLLMs for video-related tasks. However, most efforts concentrate on enhancing the vision encoder and projector components, while the core part, Large Language Models (LLMs), remains comparatively under-explored. In this paper, we propose two strategies to enhance the model's capability in video understanding tasks by improving inter-layer attention computation in LLMs. Specifically, the first approach focuses on the enhancement of Rotary Position Embedding (RoPE) with Temporal-Aware Dual RoPE, which introduces temporal position information to strengthen the MLLM's temporal modeling capabilities while preserving the relative position relationships of both visual and text tokens. The second approach involves enhancing the Attention Mask with the Frame-wise Block Causal Attention Mask, a simple yet effective method that broadens visual token interactions within and across video frames while maintaining the causal inference mechanism. Based on these proposed methods, we adapt LLaVA for video understanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our TC-LLaVA achieves new state-of-the-art performance across various video understanding benchmarks with only supervised fine-tuning (SFT) on video-related datasets.

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

TL;DR

Abstract

Paper Structure (18 sections, 13 equations, 5 figures, 5 tables)

This paper contains 18 sections, 13 equations, 5 figures, 5 tables.

Introduction
Related Work
Attention in Vision and Language Models
Video Multimodal Large Language Models
Method
Preliminary: Introducing Position Embeddings
Temporal-Aware Dual RoPE
Frame-wise Block Causal Attention Mask
Experiments
Experimental Setup
Comparison with SOTA
Ablation Studies
Time-Aware RoPE Ablation
Attention Mask and Combination Ablation
Other Time-Aware Position Embedding
...and 3 more sections

Figures (5)

Figure 1: Video language processing with LLaMA touvron2023llama and our TC-LLaVA, where arrows represent the attention interactions with this token, and numbers indicate the relative positional distance between tokens. Vanilla Attention uniformly encodes and applies attention interactions to both visual and text tokens. The proposed TC Attention incorporates temporal information encoding and differentiates interactions between visual tokens within and across frames, which are indicated by different colors.
Figure 2: The framework of TC-LLaVA. During SFT stage, the projector and the language model (LLM) are unfrozen, while the visual encoder remains frozen. The right part illustrates our TC-attention mechanism in each Transformer layer. After applying Temporal-Aware Dual RoPE, both visual and text tokens acquire additional temporal positional information while preserving the global relative positional relationships. Frame-wise Block Causal Mask aims to enhance the visual tokens interactions within and across frames.
Figure 3: Variations of Attention Masks. To explore attention mechanisms for better interactions, we compare Causal Mask (1.) with three variants: Full Visual Mask (2.), Frame-wise (Fw) Block Mask (3.), and Frame-wise (Fw) Block Causal Mask (4.). Red indicates pure visual token interactions, blue represents pure text token interactions, and purple denotes interactions between visual and text tokens.
Figure 4: Different ratio $\gamma$ settings of Time-Aware RoPE on MVbench. The red dashed line in the figure represents the baseline performance, which is the performance without adding the time-aware rope. The blue line shows the performance variations of the model under different ratio settings.
Figure 5: Comparison of attention weights with corresponding attention mask for Vanilla attention (a) and TC attention (b). Lighter colors represent higher weights.

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

TL;DR

Abstract

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations

Authors

TL;DR

Abstract

Table of Contents

Figures (5)