Table of Contents
Fetching ...

SEAL: Semantic Attention Learning for Long Video Representation

Lan Wang, Yujia Chen, Du Tran, Vishnu Naresh Boddeti, Wen-Sheng Chu

TL;DR

SEAL addresses the core challenges of long-video understanding—computational burden, temporal redundancy, and cross-task generalization—by decomposing videos into three semantic token types (scene, object, action) and applying a query-guided attention learning mechanism framed as a fixed-size subset selection. The approach supports both global and streaming processing, enabling efficient handling of arbitrarily long videos. Experimental results on LVBench, MovieChat-1K, and Ego4D-NLQ show state-of-the-art performance across video QA and temporal grounding tasks, with strong generalization using a relatively compact model. The combination of semantic decomposition, relevance-diversity optimizing attention, and flexible prediction heads demonstrates practical impact for scalable long-video understanding.

Abstract

Long video understanding presents challenges due to the inherent high computational complexity and redundant temporal information. An effective representation for long videos must efficiently process such redundancy while preserving essential contents for downstream tasks. This paper introduces SEmantic Attention Learning (SEAL), a novel unified representation for long videos. To reduce computational complexity, long videos are decomposed into three distinct types of semantic entities: scenes, objects, and actions, allowing models to operate on a compact set of entities rather than a large number of frames or pixels. To further address redundancy, we propose an attention learning module that balances token relevance with diversity, formulated as a subset selection optimization problem. Our representation is versatile and applicable across various long video understanding tasks. Extensive experiments demonstrate that SEAL significantly outperforms state-of-the-art methods in video question answering and temporal grounding tasks across diverse benchmarks, including LVBench, MovieChat-1K, and Ego4D.

SEAL: Semantic Attention Learning for Long Video Representation

TL;DR

SEAL addresses the core challenges of long-video understanding—computational burden, temporal redundancy, and cross-task generalization—by decomposing videos into three semantic token types (scene, object, action) and applying a query-guided attention learning mechanism framed as a fixed-size subset selection. The approach supports both global and streaming processing, enabling efficient handling of arbitrarily long videos. Experimental results on LVBench, MovieChat-1K, and Ego4D-NLQ show state-of-the-art performance across video QA and temporal grounding tasks, with strong generalization using a relatively compact model. The combination of semantic decomposition, relevance-diversity optimizing attention, and flexible prediction heads demonstrates practical impact for scalable long-video understanding.

Abstract

Long video understanding presents challenges due to the inherent high computational complexity and redundant temporal information. An effective representation for long videos must efficiently process such redundancy while preserving essential contents for downstream tasks. This paper introduces SEmantic Attention Learning (SEAL), a novel unified representation for long videos. To reduce computational complexity, long videos are decomposed into three distinct types of semantic entities: scenes, objects, and actions, allowing models to operate on a compact set of entities rather than a large number of frames or pixels. To further address redundancy, we propose an attention learning module that balances token relevance with diversity, formulated as a subset selection optimization problem. Our representation is versatile and applicable across various long video understanding tasks. Extensive experiments demonstrate that SEAL significantly outperforms state-of-the-art methods in video question answering and temporal grounding tasks across diverse benchmarks, including LVBench, MovieChat-1K, and Ego4D.

Paper Structure

This paper contains 18 sections, 4 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Long Video Representation with Semantic Attention Learning (SEAL): (a) Conventional uniform sampling results in redundant and cluttered visual information, making it difficult for both AI models and human brains to process efficiently. (b) Decomposing long videos into semantic entities such as scenes, objects, and actions reduces temporal redundancy, thus making model training and inference more efficient. In this example, the long video $\mathcal{V}$ is decomposed into 4 scene tokens (S1--S4), 6 object tokens (O1--O6), and 4 action/event tokens (A1--A4). (c) Query-aware attention learning module improves downstream task performance by focusing on relevant information rather than processing everything. Queries (Q1--Q4) are shown with their most relevant tokens. (best viewed in color)
  • Figure 2: SEAL Overview. During semantic decomposition, a long video $\mathcal{V}$ is decomposed into semantic tokens representing scenes, objects, and actions. Then, during attention learning, these tokens and the query q, are optimized for query relevance $R(\cdot)$ and token diversity $S(\cdot)$. The resulting attended token subset is then passed to a vision or an MLLM head for predictions.
  • Figure 3: Qualitative results on LVBench. Two long videos visualized with questions, multiple choice options, and SEAL predicted answers. SEAL attends to relevant entities such as "royal family" and "stool" (Q1.a), different "meals" and "drinks" (Q2.a), "scene" and "location" (Q2.b) and correctly answers these questions. Although attending to relevant "push-up" activity (Q2.c), SEAL fails to predict the right answer due to the challenging in the causal reasoning question.
  • Figure 4: Accuracy vs. efficiency trade-off on LVBench. SEAL runs 2-3x faster than InternVL2 at the same accuracy, and is more accurate when compared at the same FPS.
  • Figure 5: Ablation studies of different values of $\alpha$ on LVBench.$\alpha = 0.9$ achieves the best performance across different tasks except for temporal grounding (TG).
  • ...and 2 more figures