Table of Contents
Fetching ...

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

Mingze Xu, Mingfei Gao, Shiyu Li, Jiasen Lu, Zhe Gan, Zhengfeng Lai, Meng Cao, Kai Kang, Yinfei Yang, Afshin Dehghan

TL;DR

SF-LLaVA-1.5 addresses the challenge of long-form video understanding with token-efficient, edge-friendly Video LLMs. It integrates a two-stream SlowFast visual front-end with a two-stage training pipeline on publicly available data to build 1B–7B parameter models. The approach delivers state-of-the-art performance on long-form video benchmarks (LongVideoBench, LVBench, MLVU) and strong results on short-form video and image tasks, while using fewer input tokens than comparable methods. The work emphasizes reproducibility and accessibility by avoiding private datasets and providing a compact training regime suitable for scalable deployment.

Abstract

We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is on highly efficient model scales (1B and 3B), demonstrating that even relatively small Video LLMs can achieve state-of-the-art performance on video understanding, meeting the demand for mobile-friendly models. Experimental results demonstrate that SF-LLaVA-1.5 achieves superior performance on a wide range of video and image tasks, with robust results at all model sizes (ranging from 1B to 7B). Notably, SF-LLaVA-1.5 achieves state-of-the-art results in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales across various video benchmarks.

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

TL;DR

SF-LLaVA-1.5 addresses the challenge of long-form video understanding with token-efficient, edge-friendly Video LLMs. It integrates a two-stream SlowFast visual front-end with a two-stage training pipeline on publicly available data to build 1B–7B parameter models. The approach delivers state-of-the-art performance on long-form video benchmarks (LongVideoBench, LVBench, MLVU) and strong results on short-form video and image tasks, while using fewer input tokens than comparable methods. The work emphasizes reproducibility and accessibility by avoiding private datasets and providing a compact training regime suitable for scalable deployment.

Abstract

We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering a token-efficient solution for long-form video understanding. We incorporate the two-stream SlowFast mechanism into a streamlined training pipeline, and perform joint video-image training on a carefully curated data mixture of only publicly available datasets. Our primary focus is on highly efficient model scales (1B and 3B), demonstrating that even relatively small Video LLMs can achieve state-of-the-art performance on video understanding, meeting the demand for mobile-friendly models. Experimental results demonstrate that SF-LLaVA-1.5 achieves superior performance on a wide range of video and image tasks, with robust results at all model sizes (ranging from 1B to 7B). Notably, SF-LLaVA-1.5 achieves state-of-the-art results in long-form video understanding (e.g., LongVideoBench and MLVU) and excels at small scales across various video benchmarks.

Paper Structure

This paper contains 25 sections, 2 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Visualization of the video understanding pipeline in SlowFast-LLaVA-1.5. Compared to its training-free pioneer xu2024slowfast, our projector and LLM are fine-tuned throughout the training cycle, while keeping the vision encoder frozen.
  • Figure 2: Visualization of Group-based SlowFast (GSF) and Interleaved SlowFast (ISF). In this example, each sliding window contains three frames, with the first frame serving as Slow (color yellow) and the others as Fast (color cyan). The number of each token indicates the timestamp of the frame it corresponds to. (Best viewed in color.)
  • Figure 3: SF-LLaVA-1.5 summarizes a video with detailed caption.
  • Figure 4: SF-LLaVA-1.5 learns the process from the video and captures text-rich details.
  • Figure 5: SF-LLaVA-1.5 understands the relative sequence of different activities.