Table of Contents
Fetching ...

SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis

Junho Kim, Hyunjun Kim, Hosu Lee, Yong Man Ro

TL;DR

This work tackles the bottleneck of context length and memory overhead in processing long, untrimmed videos with video-language models. It introduces SALOVA, a retrieval-driven framework that concentrates reasoning on a small set of video segments via a Spatio-Temporal Connector and a Segment Retrieval Router, then fuses them into LLMs through FocusFast pathways. A key contribution is SceneWalk, a densely captioned long-video dataset enabling high-quality segment-level knowledge injection, along with a three-stage training protocol (cross-modality alignment, long-video knowledge injection, and video instruction tuning). Empirically, SALOVA achieves strong long-video understanding on Video-MME and LongVideoBench, maintains competitive performance on general benchmarks, and exhibits reduced information loss by effectively targeting relevant segments. This approach offers scalable, context-efficient long-form video analysis with practical implications for targeted video QA, retrieval, and criticism-free reasoning across extended sequences.

Abstract

Despite advances in Large Multi-modal Models, applying them to long and untrimmed video content remains challenging due to limitations in context length and substantial memory overhead. These constraints often lead to significant information loss and reduced relevance in the model responses. With the exponential growth of video data across web platforms, understanding long-form video is crucial for advancing generalized intelligence. In this paper, we introduce SALOVA: Segment-Augmented LOng Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content through targeted retrieval process. We address two main challenges to achieve it: (i) We present the SceneWalk dataset, a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich descriptive context. (ii) We develop robust architectural designs integrating dynamic routing mechanism and spatio-temporal projector to efficiently retrieve and process relevant video segments based on user queries. Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries, thereby improving the contextual relevance of the generated responses. Through extensive experiments, SALOVA demonstrates enhanced capability in processing complex long-form videos, showing significant capability to maintain contextual integrity across extended sequences.

SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis

TL;DR

This work tackles the bottleneck of context length and memory overhead in processing long, untrimmed videos with video-language models. It introduces SALOVA, a retrieval-driven framework that concentrates reasoning on a small set of video segments via a Spatio-Temporal Connector and a Segment Retrieval Router, then fuses them into LLMs through FocusFast pathways. A key contribution is SceneWalk, a densely captioned long-video dataset enabling high-quality segment-level knowledge injection, along with a three-stage training protocol (cross-modality alignment, long-video knowledge injection, and video instruction tuning). Empirically, SALOVA achieves strong long-video understanding on Video-MME and LongVideoBench, maintains competitive performance on general benchmarks, and exhibits reduced information loss by effectively targeting relevant segments. This approach offers scalable, context-efficient long-form video analysis with practical implications for targeted video QA, retrieval, and criticism-free reasoning across extended sequences.

Abstract

Despite advances in Large Multi-modal Models, applying them to long and untrimmed video content remains challenging due to limitations in context length and substantial memory overhead. These constraints often lead to significant information loss and reduced relevance in the model responses. With the exponential growth of video data across web platforms, understanding long-form video is crucial for advancing generalized intelligence. In this paper, we introduce SALOVA: Segment-Augmented LOng Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content through targeted retrieval process. We address two main challenges to achieve it: (i) We present the SceneWalk dataset, a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich descriptive context. (ii) We develop robust architectural designs integrating dynamic routing mechanism and spatio-temporal projector to efficiently retrieve and process relevant video segments based on user queries. Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries, thereby improving the contextual relevance of the generated responses. Through extensive experiments, SALOVA demonstrates enhanced capability in processing complex long-form videos, showing significant capability to maintain contextual integrity across extended sequences.

Paper Structure

This paper contains 57 sections, 1 equation, 8 figures, 7 tables.

Figures (8)

  • Figure 1: The overview of the SceneWalk dataset includes (a) dataset comparison, (b) detailed statistics, and (c) the annotation pipeline for description and score collection. Note that the scale of circles in \ref{['fig:1']}(a) indicates the data size, and the color distribution in \ref{['fig:1']}(b) denotes the video duration in each video category— brighter colors correspond to shorter video durations. Further details about the dataset are provided in Appendix A.
  • Figure 2: The network overview of SALOVA. Our framework consists of four structural components: vision encoder, ST-connector, SR-router, and LLMs. Using the FocusFast strategy, our model can concentrate on more detailed local information while maintaining context awareness.
  • Figure 3: Comparison results of V-NIAH. The x/y-axis indicates the total video frames and the location of needle image within the video, respectively.
  • Figure 4: Detailed video duration range statistics for each video category in the SceneWalk dataset.
  • Figure 5: WordCloud analysis of the SceneWalk dataset.
  • ...and 3 more figures