Table of Contents
Fetching ...

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, Mike Zheng Shou

TL;DR

This work presents LiveCC, a scalable framework for training Video LLMs by densely interleaving temporally aligned ASR transcripts with video frames, enabling low-latency real-time commentary. It introduces two datasets, Live-CC-5M for pretraining and Live-WhisperX-526K for instruction tuning, and a streaming training objective built atop Qwen2-VL-7B-Base. The authors also introduce LiveSports-3K, a dual-track benchmark for streaming commentary and video QA, and demonstrate that LiveCC-7B-Instruct achieves state-of-the-art performance on several video QA benchmarks and competitive, or superior, streaming commentary against much larger models. The approach demonstrates end-to-end scalability and practical real-time capabilities for video understanding, with publicly released resources to support further research and application development.

Abstract

Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary model APIs (e.g., GPT-4o) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method naturally fits the streaming characteristics of ASR, thus enabling the model to learn temporally-aligned, fine-grained vision-language modeling. To support the training algorithm, we introduce a data production pipeline to process YouTube videos and their closed captions (CC, same as ASR), resulting in Live-CC-5M dataset for pre-training and Live-WhisperX-526K dataset for high-quality supervised fine-tuning (SFT). Remarkably, even without SFT, the ASR-only pre-trained LiveCC-7B-Base model demonstrates competitive general video QA performance and exhibits a new capability in real-time video commentary. To evaluate this, we carefully design a new LiveSports-3K benchmark, using LLM-as-a-judge to measure the free-form commentary. Experiments show our final LiveCC-7B-Instruct model can surpass advanced 72B models (Qwen2.5-VL-72B-Instruct, LLaVA-Video-72B) in commentary quality even working in a real-time mode. Meanwhile, it achieves state-of-the-art results at the 7B/8B scale on popular video QA benchmarks such as VideoMME and OVOBench, demonstrating the broad generalizability of our approach. All resources of this paper have been released at https://showlab.github.io/livecc.

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

TL;DR

This work presents LiveCC, a scalable framework for training Video LLMs by densely interleaving temporally aligned ASR transcripts with video frames, enabling low-latency real-time commentary. It introduces two datasets, Live-CC-5M for pretraining and Live-WhisperX-526K for instruction tuning, and a streaming training objective built atop Qwen2-VL-7B-Base. The authors also introduce LiveSports-3K, a dual-track benchmark for streaming commentary and video QA, and demonstrate that LiveCC-7B-Instruct achieves state-of-the-art performance on several video QA benchmarks and competitive, or superior, streaming commentary against much larger models. The approach demonstrates end-to-end scalability and practical real-time capabilities for video understanding, with publicly released resources to support further research and application development.

Abstract

Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary model APIs (e.g., GPT-4o) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method naturally fits the streaming characteristics of ASR, thus enabling the model to learn temporally-aligned, fine-grained vision-language modeling. To support the training algorithm, we introduce a data production pipeline to process YouTube videos and their closed captions (CC, same as ASR), resulting in Live-CC-5M dataset for pre-training and Live-WhisperX-526K dataset for high-quality supervised fine-tuning (SFT). Remarkably, even without SFT, the ASR-only pre-trained LiveCC-7B-Base model demonstrates competitive general video QA performance and exhibits a new capability in real-time video commentary. To evaluate this, we carefully design a new LiveSports-3K benchmark, using LLM-as-a-judge to measure the free-form commentary. Experiments show our final LiveCC-7B-Instruct model can surpass advanced 72B models (Qwen2.5-VL-72B-Instruct, LLaVA-Video-72B) in commentary quality even working in a real-time mode. Meanwhile, it achieves state-of-the-art results at the 7B/8B scale on popular video QA benchmarks such as VideoMME and OVOBench, demonstrating the broad generalizability of our approach. All resources of this paper have been released at https://showlab.github.io/livecc.

Paper Structure

This paper contains 21 sections, 1 equation, 13 figures, 5 tables.

Figures (13)

  • Figure 1: LiveCC provides real-time commentary for streaming video, emulating a human commentator. This example is drawn from the YouTube video (ID: https://www.youtube.com/watch?v=I7pTpMjqNRM), featuring the Paris 2024 Olympics Men's Basketball Final between France and the USA. Our 7B model generates continuous commentary with a latency of less than 0.5 seconds per frame, supporting real-time applications at 2 FPS.
  • Figure 2: LiveCC data production pipeline. We begin by integrating several large-scale YouTube video datasets hdvilayttemporalvidchaptershowto100mllava-video-178k, followed by metadata filtering, resulting in a curated pool of 5.7M videos. Then, the pre-training dataset is built using the original YouTube CC, while the SFT dataset leverages higher-quality ASR transcriptions generated by WhisperX whisperxwhisper. We also introduce a set of efficient filtering techniques to improve the SFT data quality. Please refer to Section \ref{['sec:data']} for details.
  • Figure 3: Overview of our proposed Live-CC-5M dataset.
  • Figure 4: Overview of the Live-WhisperX-526K dataset.
  • Figure 5: Modeling Overview of LiveCC. The model processes streaming video frames through a visual encoder to produce visual tokens while assigning ASR text from corresponding frame intervals as text tokens. The LLM autoregressively predicts text tokens within this densely interleaved token sequence. To mitigate learning ambiguity, additional context of preceding ASR text or video title is provided during pre-training. During SFT, the context part is only user query to match the real-world applications.
  • ...and 8 more figures