Table of Contents
Fetching ...

StreamChat: Chatting with Streaming Video

Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan Kautz, Hongsheng Li, Jose M. Alvare

TL;DR

StreamChat tackles the problem of delayed and temporally misaligned responses in streaming video by updating visual context at every decoding step. It introduces a cross-attention–driven architecture with visual feedforward experts and a parallel 3D-RoPE to robustly encode temporal dynamics, together with a dense streaming instruction dataset for training. The approach achieves competitive results on standard image and video benchmarks and demonstrates superior performance in streaming-interaction scenarios, outperforming larger baselines at smaller model scales. Extensive ablations and a dedicated streaming benchmark substantiate the value of dynamic visual-context updating for real-time multimodal reasoning.

Abstract

This paper presents StreamChat, a novel approach that enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming interaction scenarios, existing methods rely solely on visual information available at the moment a question is posed, resulting in significant delays as the model remains unaware of subsequent changes in the streaming video. StreamChat addresses this limitation by innovatively updating the visual context at each decoding step, ensuring that the model utilizes up-to-date video content throughout the decoding process. Additionally, we introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs while maintaining inference efficiency for streaming interactions. Furthermore, we construct a new dense instruction dataset to facilitate the training of streaming interaction models, complemented by a parallel 3D-RoPE mechanism that encodes the relative temporal information of visual and text tokens. Experimental results demonstrate that StreamChat achieves competitive performance on established image and video benchmarks and exhibits superior capabilities in streaming interaction scenarios compared to state-of-the-art video LMM.

StreamChat: Chatting with Streaming Video

TL;DR

StreamChat tackles the problem of delayed and temporally misaligned responses in streaming video by updating visual context at every decoding step. It introduces a cross-attention–driven architecture with visual feedforward experts and a parallel 3D-RoPE to robustly encode temporal dynamics, together with a dense streaming instruction dataset for training. The approach achieves competitive results on standard image and video benchmarks and demonstrates superior performance in streaming-interaction scenarios, outperforming larger baselines at smaller model scales. Extensive ablations and a dedicated streaming benchmark substantiate the value of dynamic visual-context updating for real-time multimodal reasoning.

Abstract

This paper presents StreamChat, a novel approach that enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming interaction scenarios, existing methods rely solely on visual information available at the moment a question is posed, resulting in significant delays as the model remains unaware of subsequent changes in the streaming video. StreamChat addresses this limitation by innovatively updating the visual context at each decoding step, ensuring that the model utilizes up-to-date video content throughout the decoding process. Additionally, we introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs while maintaining inference efficiency for streaming interactions. Furthermore, we construct a new dense instruction dataset to facilitate the training of streaming interaction models, complemented by a parallel 3D-RoPE mechanism that encodes the relative temporal information of visual and text tokens. Experimental results demonstrate that StreamChat achieves competitive performance on established image and video benchmarks and exhibits superior capabilities in streaming interaction scenarios compared to state-of-the-art video LMM.

Paper Structure

This paper contains 21 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Example of StreamChat on streaming video. In the example, the question is asked at the 11th second. As the model outputs its text steam, it continuously follows the dynamic content of the streaming video and uses up-to-date video content to answer the question.
  • Figure 2: Comparison of context in the decoding process with existing models. For each text token, the black and blue arrows indicate the beginning and end of the utilized visual context, respectively. While existing models (top) use a fixed visual context when decoding, StreamChat (bottom) aligns the video and text streams temporally and dynamically updates its visual context based on the streaming video.
  • Figure 3: The StreamChat architecture. We utilize cross-attention blocks to bridge the visual and text tokens and V-FFN blocks to update the visual tokens throughout the LLM's forward process. Those two blocks' outputs are scaled with a linear gate mechanism.
  • Figure 4: The parallel 3D-RoPE. For visual and text tokens at the same timestamp, they share the same temporal position.
  • Figure 5: Comparison of StreamChat with leading video LMMs on streaming evaluation. We use StreamChat-7B/-14B as one of the candidate models and report the win/tie/loss rate against VILA or LLaVA-Video models. Our StreamChat models demonstrate stronger streaming interaction capabilities, and can even outperform LLaVA-Video-72B which uses a much larger base LLM.
  • ...and 1 more figures