SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding
Zhenyu Yang, Yuhang Hu, Zemin Du, Dizhan Xue, Shengsheng Qian, Jiahong Wu, Fan Yang, Weiming Dong, Changsheng Xu
TL;DR
SVBench addresses the gap in evaluating LVLMs for streaming video by introducing a large-scale benchmark built around temporal multi-turn dialogues. It constructs QA chains aligned with video segments and explicit temporal linkages to assess long-context reasoning, annotating 1,353 videos with 49,979 QA pairs. The authors introduce StreamingChat, a streaming LVLM that leverages InternViT and InternLM2 with LoRA and a 32k context window, achieving strong open-source performance and competitive results with closed-source models. Across two evaluation setups, SVBench reveals that current LVLMs struggle with long-context streaming understanding, motivating continued improvements and providing a real-time leaderboard and open-access resources. Overall, SVBench demonstrates both the necessity and feasibility of rigorous streaming video evaluation and proposes a concrete path for advancing streaming multimodal models.
Abstract
Despite the significant advancements of Large Vision-Language Models (LVLMs) on established benchmarks, there remains a notable gap in suitable evaluation regarding their applicability in the emerging domain of long-context streaming video understanding. Current benchmarks for video understanding typically emphasize isolated single-instance text inputs and fail to evaluate the capacity to sustain temporal reasoning throughout the entire duration of video streams. To address these limitations, we introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains specifically designed to thoroughly assess the capabilities of streaming video understanding of current LVLMs. We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos, which includes generating QA chains that represent a series of consecutive multi-turn dialogues over video segments and constructing temporal linkages between successive QA chains. Our experimental results, obtained from 14 models in dialogue and streaming evaluations, reveal that while the closed-source GPT-4o outperforms others, most open-source LVLMs struggle with long-context streaming video understanding. We also construct a StreamingChat model, which significantly outperforms open-source LVLMs on our SVBench and achieves comparable performance on diverse vision-language benchmarks. We expect SVBench to advance the research of streaming video understanding by providing a comprehensive and in-depth analysis of current LVLMs. Our benchmark and model can be accessed at https://github.com/sotayang/SVBench.
