Table of Contents
Fetching ...

An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes

Ji Qi, Yuan Yao, Yushi Bai, Bin Xu, Juanzi Li, Zhiyuan Liu, Tat-Seng Chua

TL;DR

Quicksviewer introduces a nonuniform perception paradigm for video understanding by partitioning long videos into cubes of varying lengths via a cubing network guided by frame-level momentum, followed by a fixed-token 3D resampling stage. A unified training objective couples Gumbel Softmax-based cubing with an auxiliary loss and a thumbnail gradient path to enable end-to-end learning without boundary labels. The approach achieves a 45× compression rate and a 420-frame receptive field during pretraining, while delivering state-of-the-art results on Video-MME with only 0.8M video-text samples and minimal per-frame tokens, along with competitive performance on other benchmarks. Extensive ablations validate the importance of 3D positional encoding, Gumbel noise annealing, and joint fine-tuning, and qualitative analyses demonstrate effective long-video and multi-image understanding. Overall, the work presents a scalable, efficient framework for video-language models that can handle ultra-long videos with limited supervision and data.

Abstract

Large Multimodal Models (LMMs) uniformly perceive video frames, creating computational inefficiency for videos with inherently varying temporal information density. This paper present \textbf{Quicksviewer}, an LMM with new perceiving paradigm that partitions a video of nonuniform density into varying cubes using Gumbel Softmax, followed by a unified resampling for each cube to achieve efficient video understanding. This simple and intuitive approach dynamically compress video online based on its temporal density, significantly reducing spatiotemporal redundancy (overall 45$\times$ compression rate), while enabling efficient training with large receptive field. We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency. With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy, demonstrating the effectiveness in performance. On Video-MME, Quicksviewer achieves SOTA under modest sequence lengths using just up to 5\% of tokens per frame required by baselines. With this paradigm, scaling up the number of input frames reveals a clear power law of the model capabilities. It is also empirically verified that the segments generated by the cubing network can help for analyzing continuous events in videos.

An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes

TL;DR

Quicksviewer introduces a nonuniform perception paradigm for video understanding by partitioning long videos into cubes of varying lengths via a cubing network guided by frame-level momentum, followed by a fixed-token 3D resampling stage. A unified training objective couples Gumbel Softmax-based cubing with an auxiliary loss and a thumbnail gradient path to enable end-to-end learning without boundary labels. The approach achieves a 45× compression rate and a 420-frame receptive field during pretraining, while delivering state-of-the-art results on Video-MME with only 0.8M video-text samples and minimal per-frame tokens, along with competitive performance on other benchmarks. Extensive ablations validate the importance of 3D positional encoding, Gumbel noise annealing, and joint fine-tuning, and qualitative analyses demonstrate effective long-video and multi-image understanding. Overall, the work presents a scalable, efficient framework for video-language models that can handle ultra-long videos with limited supervision and data.

Abstract

Large Multimodal Models (LMMs) uniformly perceive video frames, creating computational inefficiency for videos with inherently varying temporal information density. This paper present \textbf{Quicksviewer}, an LMM with new perceiving paradigm that partitions a video of nonuniform density into varying cubes using Gumbel Softmax, followed by a unified resampling for each cube to achieve efficient video understanding. This simple and intuitive approach dynamically compress video online based on its temporal density, significantly reducing spatiotemporal redundancy (overall 45 compression rate), while enabling efficient training with large receptive field. We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency. With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy, demonstrating the effectiveness in performance. On Video-MME, Quicksviewer achieves SOTA under modest sequence lengths using just up to 5\% of tokens per frame required by baselines. With this paradigm, scaling up the number of input frames reveals a clear power law of the model capabilities. It is also empirically verified that the segments generated by the cubing network can help for analyzing continuous events in videos.

Paper Structure

This paper contains 22 sections, 3 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Quicksviewer involves a cubing network that partitions a video into nonuniform cubes, followed by a 3D resampler to gather a fixed number of visual tokens per cube. This efficiency enables Large Receptive Field (420 frames) with High Compression Rate (64$\times$) during all training stages, and scaling laws on extended frames in inference.
  • Figure 2: Left: The network architecture of Quicksviewer, that performs unified understanding of videos and images through visual tokens from cascaded modules. Right: The cubing network, that partitions an online video into nonuniform cubes based on Gumbel Softmax.
  • Figure 3: (a) Left: Performance of Quicksviewer on particular domains and categories of Video-MME. (b) Right: Distribution of cube lengths across Video-MME videos.
  • Figure 4: The "Visual Lag" phenomenon occurring during the model's cube-based segmental comprehension, where current cubes incorporate terminal frames from preceding event scenes to enable retrospective understanding.
  • Figure 5: (a) Left: Gumbel noise progressively anneals to 0.001 following the decaying learning rate with cosine scheduler. (b) Right: Compared to non-annealed training (cyan curve), adding Gumbel noise annealing (purple curve) yields more stable and superior loss convergence.
  • ...and 1 more figures