Table of Contents
Fetching ...

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou

TL;DR

This work addresses the prohibitive compute of online video large language models by introducing VideoLLM-MoD, a Mixture-of-Depths approach that sparsifies vision-token processing across transformer layers with a learnable LayerExpert. By skipping computation for a large subset of vision tokens within chosen layers and routing only the most informative tokens to self-attention/FFN, the method achieves substantial training-time and memory savings (approximately $42\%$ time and $30\%$ memory) while preserving or improving performance thanks to maintained contextual information. The approach demonstrates state-of-the-art results across narration, forecasting, and summarization tasks on Ego4D, EgoExo4D, and COIN datasets, and generalizes to offline video settings as well. The practical impact is a more efficient, scalable online video understanding system capable of temporally aligned responses with reduced resource requirements.

Abstract

A well-known dilemma in large vision-language models (e.g., GPT-4, LLaVA) is that while increasing the number of vision tokens generally enhances visual understanding, it also significantly raises memory and computational costs, especially in long-term, dense video frame streaming scenarios. Although learnable approaches like Q-Former and Perceiver Resampler have been developed to reduce the vision token burden, they overlook the context causally modeled by LLMs (i.e., key-value cache), potentially leading to missed visual cues when addressing user queries. In this paper, we introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video. Specifically, for each transformer layer, we learn to skip the computation for a high proportion (e.g., 80\%) of vision tokens, passing them directly to the next layer. This approach significantly enhances model efficiency, achieving approximately \textasciitilde42\% time and \textasciitilde30\% memory savings for the entire training. Moreover, our method reduces the computation in the context and avoid decreasing the vision tokens, thus preserving or even improving performance compared to the vanilla model. We conduct extensive experiments to demonstrate the effectiveness of VideoLLM-MoD, showing its state-of-the-art results on multiple benchmarks, including narration, forecasting, and summarization tasks in COIN, Ego4D, and Ego-Exo4D datasets.

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

TL;DR

This work addresses the prohibitive compute of online video large language models by introducing VideoLLM-MoD, a Mixture-of-Depths approach that sparsifies vision-token processing across transformer layers with a learnable LayerExpert. By skipping computation for a large subset of vision tokens within chosen layers and routing only the most informative tokens to self-attention/FFN, the method achieves substantial training-time and memory savings (approximately time and memory) while preserving or improving performance thanks to maintained contextual information. The approach demonstrates state-of-the-art results across narration, forecasting, and summarization tasks on Ego4D, EgoExo4D, and COIN datasets, and generalizes to offline video settings as well. The practical impact is a more efficient, scalable online video understanding system capable of temporally aligned responses with reduced resource requirements.

Abstract

A well-known dilemma in large vision-language models (e.g., GPT-4, LLaVA) is that while increasing the number of vision tokens generally enhances visual understanding, it also significantly raises memory and computational costs, especially in long-term, dense video frame streaming scenarios. Although learnable approaches like Q-Former and Perceiver Resampler have been developed to reduce the vision token burden, they overlook the context causally modeled by LLMs (i.e., key-value cache), potentially leading to missed visual cues when addressing user queries. In this paper, we introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video. Specifically, for each transformer layer, we learn to skip the computation for a high proportion (e.g., 80\%) of vision tokens, passing them directly to the next layer. This approach significantly enhances model efficiency, achieving approximately \textasciitilde42\% time and \textasciitilde30\% memory savings for the entire training. Moreover, our method reduces the computation in the context and avoid decreasing the vision tokens, thus preserving or even improving performance compared to the vanilla model. We conduct extensive experiments to demonstrate the effectiveness of VideoLLM-MoD, showing its state-of-the-art results on multiple benchmarks, including narration, forecasting, and summarization tasks in COIN, Ego4D, and Ego-Exo4D datasets.
Paper Structure (19 sections, 4 equations, 7 figures, 5 tables)

This paper contains 19 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Cases of $\text{VideoLLM-MoD}$ on Ego4D GoalStep ego4d_goalstep video. Using only CLS token often results in spatial understanding errors, e.g., mistaking 'broccoli' for 'bell pepper.' $\text{VideoLLM-MoD}$ improves fine-grained spatial ability by integrating more spatial tokens while reducing computation costs compared to the improved baseline. Text in red indicates incorrect response.
  • Figure 2: Training Computation Cost. $\text{VideoLLM-MoD}$ exhibits greater efficiency compared to the baseline.
  • Figure 3: $\text{VideoLLM-MoD}$ selects the top-$k$ vision tokens within each frame in certain layers via LayerExpert. We observe that performance drops dramatically with Early-exit as critical vision tokens miss subsequent processing. By retaining crucial vision tokens in certain layers and reducing redundant tokens that may mislead understanding, $\text{VideoLLM-MoD}$ achieves better performance with significantly lower computation costs compared to Full-computation baseline.
  • Figure 4: Efficiency analysis of $\text{VideoLLM-MoD}$ in both training and inference phase.
  • Figure 5: Examples of $\text{VideoLLM-MoD}$ on the Ego4D GoalStep ego4d_goalstep video dataset. We found that $\text{VideoLLM-MoD}$ effectively reduces hallucinations and performs more robustly than the model trained with full computation. For instance, our model correctly recognizes "pick up the box" while the baseline mistakenly identifies it as "pick up tire." Text in red indicates incorrect responses.
  • ...and 2 more figures