Table of Contents
Fetching ...

VideoLLM-online: Online Video Large Language Model for Streaming Video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou

TL;DR

This work tackles the challenge of streaming video understanding by proposing the LIVE framework, enabling temporally aligned, long-context, real-time dialogue over continuous video. It introduces a streaming EOS objective, data-generation strategies that convert offline annotations into streaming dialogues, and inference optimizations such as continuous key-value caching to meet real-time demands. VideoLLM-online, built on CLIP-based vision encoders and Llama-2/3 with LoRA fine-tuning, demonstrates real-time performance (>10 FPS on an A100) and strong offline results on COIN and Ego4D benchmarks. The approach achieves state-of-the-art performance among end-to-end models on offline tasks and provides a practical path towards always-on AI assistants in streaming video settings, while outlining avenues for data, spatial capability, and scalability improvements.

Abstract

Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-Stream (LIVE) framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue, encompassing: (1) a training objective designed to perform language modeling for continuous streaming inputs, (2) a data generation scheme that converts offline temporal annotations into a streaming dialogue format, and (3) an optimized inference pipeline to speed up the model responses in real-world video streams. With our LIVE framework, we built VideoLLM-online model upon Llama-2/Llama-3 and demonstrate its significant advantages in processing streaming videos. For instance, on average, our model can support streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU. Moreover, it also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting. The code, model, data, and demo have been made available at https://showlab.github.io/videollm-online.

VideoLLM-online: Online Video Large Language Model for Streaming Video

TL;DR

This work tackles the challenge of streaming video understanding by proposing the LIVE framework, enabling temporally aligned, long-context, real-time dialogue over continuous video. It introduces a streaming EOS objective, data-generation strategies that convert offline annotations into streaming dialogues, and inference optimizations such as continuous key-value caching to meet real-time demands. VideoLLM-online, built on CLIP-based vision encoders and Llama-2/3 with LoRA fine-tuning, demonstrates real-time performance (>10 FPS on an A100) and strong offline results on COIN and Ego4D benchmarks. The approach achieves state-of-the-art performance among end-to-end models on offline tasks and provides a practical path towards always-on AI assistants in streaming video settings, while outlining avenues for data, spatial capability, and scalability improvements.

Abstract

Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-Stream (LIVE) framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue, encompassing: (1) a training objective designed to perform language modeling for continuous streaming inputs, (2) a data generation scheme that converts offline temporal annotations into a streaming dialogue format, and (3) an optimized inference pipeline to speed up the model responses in real-world video streams. With our LIVE framework, we built VideoLLM-online model upon Llama-2/Llama-3 and demonstrate its significant advantages in processing streaming videos. For instance, on average, our model can support streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU. Moreover, it also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting. The code, model, data, and demo have been made available at https://showlab.github.io/videollm-online.
Paper Structure (20 sections, 5 equations, 8 figures, 4 tables)

This paper contains 20 sections, 5 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Zero-shot examples of our VideoLLM-online applied to an egocentric video stream from Ego-Exo4D dataset egoexo4d. Our model is designed for temporally aligned, long-context, real-time dialogue in continuous video streams, shedding light on the future always-on, contextual AI assistants (e.g., smart AR glasses). Model responses are appropriately simplified for better visualization.
  • Figure 2: Our model shows strong temporal alignment capability in streaming video narration. The query at the beginning is "Please describe what I am doing in real time".
  • Figure 3: The streaming dialogue data generation method in our LIVE framework. We randomly insert templated questions into the video timeline and "expose" the ground-truth video annotations (along with their timestamps) to LLMs, prompting them to answer the queries within a period of time.
  • Figure 4: The training method in our LIVE framework. We organize the user-assistant dialogue data and video frames in temporal order as the input sequence. To learn the model when to answer or keep silent in a video stream, we employ not only the standard language modeling (LM) loss but also introduce a streaming EOS prediction loss. This additional loss supervises the model when it is necessary to generate language, enabling it to produce temporally aligned responses and reduces the redundant dialogue history.
  • Figure 5: Inference pipeline in our LIVE framework. During inference, video frames serve as streaming inputs. Our model maintains a continuous key-value cache as the input progresses to speed up the inference. Furthermore, we parallelize the fast video frame encoder and the slower language model to avoid the bottleneck in the latter. Video frame tokens can be always encoded and buffered, no need to wait the language decoding.
  • ...and 3 more figures