StreamChat: Chatting with Streaming Video
Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan Kautz, Hongsheng Li, Jose M. Alvare
TL;DR
StreamChat tackles the problem of delayed and temporally misaligned responses in streaming video by updating visual context at every decoding step. It introduces a cross-attention–driven architecture with visual feedforward experts and a parallel 3D-RoPE to robustly encode temporal dynamics, together with a dense streaming instruction dataset for training. The approach achieves competitive results on standard image and video benchmarks and demonstrates superior performance in streaming-interaction scenarios, outperforming larger baselines at smaller model scales. Extensive ablations and a dedicated streaming benchmark substantiate the value of dynamic visual-context updating for real-time multimodal reasoning.
Abstract
This paper presents StreamChat, a novel approach that enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming interaction scenarios, existing methods rely solely on visual information available at the moment a question is posed, resulting in significant delays as the model remains unaware of subsequent changes in the streaming video. StreamChat addresses this limitation by innovatively updating the visual context at each decoding step, ensuring that the model utilizes up-to-date video content throughout the decoding process. Additionally, we introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs while maintaining inference efficiency for streaming interactions. Furthermore, we construct a new dense instruction dataset to facilitate the training of streaming interaction models, complemented by a parallel 3D-RoPE mechanism that encodes the relative temporal information of visual and text tokens. Experimental results demonstrate that StreamChat achieves competitive performance on established image and video benchmarks and exhibits superior capabilities in streaming interaction scenarios compared to state-of-the-art video LMM.
