Table of Contents
Fetching ...

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

Yunsheng Ma, Amr Abdelraouf, Rohit Gupta, Ziran Wang, Kyungtae Han

TL;DR

Video Token Sparsification (VTS), a novel approach that leverages the inherent redundancy in consecutive video frames to significantly reduce the total number of visual tokens while preserving the most salient information, is proposed.

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable potential for enhancing scene understanding in autonomous driving systems through powerful logical reasoning capabilities. However, the deployment of these models faces significant challenges due to their substantial parameter sizes and computational demands, which often exceed the constraints of onboard computation. One major limitation arises from the large number of visual tokens required to capture fine-grained and long-context visual information, leading to increased latency and memory consumption. To address this issue, we propose Video Token Sparsification (VTS), a novel approach that leverages the inherent redundancy in consecutive video frames to significantly reduce the total number of visual tokens while preserving the most salient information. VTS employs a lightweight CNN-based proposal model to adaptively identify key frames and prune less informative tokens, effectively mitigating hallucinations and increasing inference throughput without compromising performance. We conduct comprehensive experiments on the DRAMA and LingoQA benchmarks, demonstrating the effectiveness of VTS in achieving up to a 33\% improvement in inference throughput and a 28\% reduction in memory usage compared to the baseline without compromising performance.

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

TL;DR

Video Token Sparsification (VTS), a novel approach that leverages the inherent redundancy in consecutive video frames to significantly reduce the total number of visual tokens while preserving the most salient information, is proposed.

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable potential for enhancing scene understanding in autonomous driving systems through powerful logical reasoning capabilities. However, the deployment of these models faces significant challenges due to their substantial parameter sizes and computational demands, which often exceed the constraints of onboard computation. One major limitation arises from the large number of visual tokens required to capture fine-grained and long-context visual information, leading to increased latency and memory consumption. To address this issue, we propose Video Token Sparsification (VTS), a novel approach that leverages the inherent redundancy in consecutive video frames to significantly reduce the total number of visual tokens while preserving the most salient information. VTS employs a lightweight CNN-based proposal model to adaptively identify key frames and prune less informative tokens, effectively mitigating hallucinations and increasing inference throughput without compromising performance. We conduct comprehensive experiments on the DRAMA and LingoQA benchmarks, demonstrating the effectiveness of VTS in achieving up to a 33\% improvement in inference throughput and a 28\% reduction in memory usage compared to the baseline without compromising performance.
Paper Structure (23 sections, 5 equations, 3 figures, 4 tables)

This paper contains 23 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of Video Token Sparsification (VTS): The input consists of both the current frame and previous frames (referred to as Frame A and Frame B in this example). A CNN-based proposal model generates feature maps for each frame. VTS identifies a key frame (Frame A) based on these feature maps and computes a pruning score for each token in the non-key frames (Frame B), considering both saliency and dissimilarity to the corresponding tokens in the key frame. The top $s\%$ of tokens from the non-key frames are selected and concatenated with the tokens from the key frame to form the final sparsified token sequence, which is then input to the LLM for efficient reasoning.
  • Figure 2: Impact of token sparsification rate $s$ on inference throughput (request per second), memory consumption, and Lingo-Judge scores.
  • Figure 3: Visualization of VTS compared with other token-reduction-based approaches.