Table of Contents
Fetching ...

RIVER: A Real-Time Interaction Benchmark for Video LLMs

Yansong Shi, Qingsong Zhao, Tianxiang Jiang, Xiangyu Zeng, Yi Wang, Limin Wang

TL;DR

The Real-tIme Video intERaction Bench (RIVER Bench) is introduced, designed for evaluating online video comprehension, and a general improvement method is proposed that enables models to interact with users more flexibly in real time.

Abstract

The rapid advancement of multimodal large language models has demonstrated impressive capabilities, yet nearly all operate in an offline paradigm, hindering real-time interactivity. Addressing this gap, we introduce the Real-tIme Video intERaction Bench (RIVER Bench), designed for evaluating online video comprehension. RIVER Bench introduces a novel framework comprising Retrospective Memory, Live-Perception, and Proactive Anticipation tasks, closely mimicking interactive dialogues rather than responding to entire videos at once. We conducted detailed annotations using videos from diverse sources and varying lengths, and precisely defined the real-time interactive format. Evaluations across various model categories reveal that while offline models perform well in single question-answering tasks, they struggle with real-time processing. Addressing the limitations of existing models in online video interaction, especially their deficiencies in long-term memory and future perception, we proposed a general improvement method that enables models to interact with users more flexibly in real time. We believe this work will significantly advance the development of real-time interactive video understanding models and inspire future research in this emerging field. Datasets and code are publicly available at https://github.com/OpenGVLab/RIVER.

RIVER: A Real-Time Interaction Benchmark for Video LLMs

TL;DR

The Real-tIme Video intERaction Bench (RIVER Bench) is introduced, designed for evaluating online video comprehension, and a general improvement method is proposed that enables models to interact with users more flexibly in real time.

Abstract

The rapid advancement of multimodal large language models has demonstrated impressive capabilities, yet nearly all operate in an offline paradigm, hindering real-time interactivity. Addressing this gap, we introduce the Real-tIme Video intERaction Bench (RIVER Bench), designed for evaluating online video comprehension. RIVER Bench introduces a novel framework comprising Retrospective Memory, Live-Perception, and Proactive Anticipation tasks, closely mimicking interactive dialogues rather than responding to entire videos at once. We conducted detailed annotations using videos from diverse sources and varying lengths, and precisely defined the real-time interactive format. Evaluations across various model categories reveal that while offline models perform well in single question-answering tasks, they struggle with real-time processing. Addressing the limitations of existing models in online video interaction, especially their deficiencies in long-term memory and future perception, we proposed a general improvement method that enables models to interact with users more flexibly in real time. We believe this work will significantly advance the development of real-time interactive video understanding models and inspire future research in this emerging field. Datasets and code are publicly available at https://github.com/OpenGVLab/RIVER.
Paper Structure (32 sections, 1 equation, 7 figures, 7 tables, 1 algorithm)

This paper contains 32 sections, 1 equation, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of different online interaction tasks. The question (Query), reference events (Cue), and answers timings are represented by , and , respectively. Based on the frequency and timing of reference events, questions, and answers, we further categorize online interaction tasks into four distinct subclasses, as visually depicted in the figure. For the Retro-Memory, the clue is drawn from the past; for the Live-Perception, it comes from the present—both demand an immediate response. For the Pro-Response task, Video LLMs need to wait until the corresponding clue appears and then respond as quickly as possible.
  • Figure 2: The pie chart on the left illustrates the quantitative distribution of various tasks within the benchmark. The two bar charts in the middle depict the statistics of video duration and the proportion of question timestamps, respectively. On the right, a word cloud which is constructed from the annotated textual data within the dataset, visually emphasizing the most frequently occurring terms.
  • Figure 3: Illustration of data processing process.
  • Figure 4: Pipeline to enable MLLMs to support online inference capabilities. The Long Short-Term Memory module continuously receives new visual features and selects the most important parts. After a query is posed at $t_0$, the model is queried at each time window; if it decides to answer, the final response is output.
  • Figure 5: Memory curve of MLLM li2024mvbench and video agent fan2024videoagent under different query time conditions.
  • ...and 2 more figures