ViSpeak: Visual Instruction Feedback in Streaming Videos

Shenghao Fu; Qize Yang; Yuan-Ming Li; Yi-Xing Peng; Kun-Yu Lin; Xihan Wei; Jian-Fang Hu; Xiaohua Xie; Wei-Shi Zheng

ViSpeak: Visual Instruction Feedback in Streaming Videos

Shenghao Fu, Qize Yang, Yuan-Ming Li, Yi-Xing Peng, Kun-Yu Lin, Xihan Wei, Jian-Fang Hu, Xiaohua Xie, Wei-Shi Zheng

TL;DR

ViSpeak addresses the challenge of streaming video understanding by introducing Visual Instruction Feedback, a setting where agents must actively respond to visual content in real time. It presents a three-stage finetuning pipeline that adapts an offline omni-modal model to streaming with a two-stream input template and an informative head for proactive outputs, achieving GPT-4o-level performance on key benchmarks. The work also provides ViSpeak-Bench and ViSpeak-Instruct datasets to evaluate and train visual-instruction feedback capabilities, demonstrating SOTA results on StreamingBench and OVO-Bench among open-source models. Together, these contributions advance real-time human-agent interaction in video streams and offer practical datasets and baselines for future research.

Abstract

Recent advances in Large Multi-modal Models (LMMs) are primarily focused on offline video understanding. Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-modal and interactive characteristics. In this work, we aim to extend the streaming video understanding from a new perspective and propose a novel task named Visual Instruction Feedback in which models should be aware of visual contents and learn to extract instructions from them. For example, when users wave their hands to agents, agents should recognize the gesture and start conversations with welcome information. Thus, following instructions in visual modality greatly enhances user-agent interactions. To facilitate research, we define seven key subtasks highly relevant to visual modality and collect the ViSpeak-Instruct dataset for training and the ViSpeak-Bench for evaluation. Further, we propose the ViSpeak model, which is a SOTA streaming video understanding LMM with GPT-4o-level performance on various streaming video understanding benchmarks. After finetuning on our ViSpeak-Instruct dataset, ViSpeak is equipped with basic visual instruction feedback ability, serving as a solid baseline for future research.

ViSpeak: Visual Instruction Feedback in Streaming Videos

TL;DR

Abstract

ViSpeak: Visual Instruction Feedback in Streaming Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)