Table of Contents
Fetching ...

LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant

Wei Li, Bing Hu, Rui Shao, Leyang Shen, Liqiang Nie

TL;DR

LION-FS tackles the dual challenge of real-time responsiveness and precise, context-aware online video dialogue in first-person videos. It introduces a fast–slow two-path architecture: a Fast Path with Token Aggregation Router to fuse high-frame-rate egocentric features with low-frame-rate general features and a Token Dropping Router to reduce computation, and a Slow Path that augments keyframes with Grid Tokens and Box Tokens via a Multimodal Thinking Template for richer, more accurate responses. The approach leverages dual encoders (EgoVLPv2 and SigLIP) for complementary visual information and demonstrates state-of-the-art efficacy and efficiency on Ego4D and Ego-Exo4D streaming benchmarks, including higher fluency and correctness with fourfold higher input frame rates. These contributions offer a practical online video assistant capable of proactive, temporally precise guidance in real-time settings, while maintaining computational efficiency suitable for head-mounted or lightweight devices.

Abstract

First-person video assistants are highly anticipated to enhance our daily lives through online video dialogue. However, existing online video assistants often sacrifice assistant efficacy for real-time efficiency by processing low-frame-rate videos with coarse-grained visual features.To overcome the trade-off between efficacy and efficiency, we propose "Fast & Slow Video-Language Thinker" as an onLIne videO assistaNt, LION-FS, achieving real-time, proactive, temporally accurate, and contextually precise responses. LION-FS adopts a two-stage optimization strategy: 1)Fast Path: Routing-Based Response Determination evaluates frame-by-frame whether an immediate response is necessary. To enhance response determination accuracy and handle higher frame-rate inputs efficiently, we employ Token Aggregation Routing to dynamically fuse spatiotemporal features without increasing token numbers, while utilizing Token Dropping Routing to eliminate redundant features. 2)Slow Path: Multi-granularity Keyframe Augmentation optimizes keyframes during response generation. To provide comprehensive and detailed responses beyond atomic actions constrained by training data, fine-grained spatial features and human-environment interaction features are extracted through multi-granular pooling. These features are further integrated into a meticulously designed multimodal Thinking Template to guide more precise response generation. Comprehensive evaluations on online video tasks demonstrate that LION-FS achieves state-of-the-art efficacy and efficiency.

LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant

TL;DR

LION-FS tackles the dual challenge of real-time responsiveness and precise, context-aware online video dialogue in first-person videos. It introduces a fast–slow two-path architecture: a Fast Path with Token Aggregation Router to fuse high-frame-rate egocentric features with low-frame-rate general features and a Token Dropping Router to reduce computation, and a Slow Path that augments keyframes with Grid Tokens and Box Tokens via a Multimodal Thinking Template for richer, more accurate responses. The approach leverages dual encoders (EgoVLPv2 and SigLIP) for complementary visual information and demonstrates state-of-the-art efficacy and efficiency on Ego4D and Ego-Exo4D streaming benchmarks, including higher fluency and correctness with fourfold higher input frame rates. These contributions offer a practical online video assistant capable of proactive, temporally precise guidance in real-time settings, while maintaining computational efficiency suitable for head-mounted or lightweight devices.

Abstract

First-person video assistants are highly anticipated to enhance our daily lives through online video dialogue. However, existing online video assistants often sacrifice assistant efficacy for real-time efficiency by processing low-frame-rate videos with coarse-grained visual features.To overcome the trade-off between efficacy and efficiency, we propose "Fast & Slow Video-Language Thinker" as an onLIne videO assistaNt, LION-FS, achieving real-time, proactive, temporally accurate, and contextually precise responses. LION-FS adopts a two-stage optimization strategy: 1)Fast Path: Routing-Based Response Determination evaluates frame-by-frame whether an immediate response is necessary. To enhance response determination accuracy and handle higher frame-rate inputs efficiently, we employ Token Aggregation Routing to dynamically fuse spatiotemporal features without increasing token numbers, while utilizing Token Dropping Routing to eliminate redundant features. 2)Slow Path: Multi-granularity Keyframe Augmentation optimizes keyframes during response generation. To provide comprehensive and detailed responses beyond atomic actions constrained by training data, fine-grained spatial features and human-environment interaction features are extracted through multi-granular pooling. These features are further integrated into a meticulously designed multimodal Thinking Template to guide more precise response generation. Comprehensive evaluations on online video tasks demonstrate that LION-FS achieves state-of-the-art efficacy and efficiency.

Paper Structure

This paper contains 25 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison between LIVE VideoLLM-onlineand LION-FS. LIVE processes low-frame-rate videos using coarse-grained image tokens, resulting in suboptimal accuracy in response. LION-FS, by efficiently handling high-frame-rate videos through Fast-Path dynamical spatiotemporal fusion and Slow-Path multi-granular keyframe augmentation, significantly enhances response determination accuracy and content precision.
  • Figure 2: The whole framework of LION-FS. Fast Path enables high-frame-rate video stream reception, allowing real-time determination of whether a response is required. $E_{gen}$ (SigLIP siglip) extracts general spatial features from 2 FPS frames, while $E_{ego}$ (EgoVLPv2 egovlpv2) captures first-person temporal features from 8 FPS frames. These are temporally aligned, weighted through the Token Aggregation Router, and then filtered for redundancy by the Token Dropping Router. Slow Path enhances keyframes with rich information, performing multi-granularity augmentation that includes fine-grained global tokens (Grid Tokens) and action-related local tokens (Box Tokens), which are injected into the Multimodal Thinking Template to guide the assistant in generating more precise responses.
  • Figure 3: Different Token Aggregation Strategies: (a) Concatenate tokens along the sequence dimension. (b) Aggregate tokens based on adaptive weights generated by the router to perform customized routing. It can achieve visual information aggregation without increasing token numbers.
  • Figure 4: Boxplot Visualization of token aggregation routing outcomes. We select the weights of $E_{gen}$ for analysis. The Token 1 is the CLS token, highlighted in yellow.
  • Figure 5: Quantitative analysis of LIVE VideoLLM-onlineand LION-FSon the Ego4D dataset. represents the questions posed by the user, such as the request "Please narrate the video in real-time" at 0.0s and "How did I repair the bicycle?" at 49.0s. Purple highlights indicate imprecise responses, while red highlights denote incorrect responses. LION-FS achieves progressive improvements in response precision through the integration of Fast Path and Slow Path mechanisms.
  • ...and 1 more figures