Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

Yunsheng Ma; Amr Abdelraouf; Rohit Gupta; Ziran Wang; Kyungtae Han

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

Yunsheng Ma, Amr Abdelraouf, Rohit Gupta, Ziran Wang, Kyungtae Han

TL;DR

Video Token Sparsification (VTS), a novel approach that leverages the inherent redundancy in consecutive video frames to significantly reduce the total number of visual tokens while preserving the most salient information, is proposed.

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable potential for enhancing scene understanding in autonomous driving systems through powerful logical reasoning capabilities. However, the deployment of these models faces significant challenges due to their substantial parameter sizes and computational demands, which often exceed the constraints of onboard computation. One major limitation arises from the large number of visual tokens required to capture fine-grained and long-context visual information, leading to increased latency and memory consumption. To address this issue, we propose Video Token Sparsification (VTS), a novel approach that leverages the inherent redundancy in consecutive video frames to significantly reduce the total number of visual tokens while preserving the most salient information. VTS employs a lightweight CNN-based proposal model to adaptively identify key frames and prune less informative tokens, effectively mitigating hallucinations and increasing inference throughput without compromising performance. We conduct comprehensive experiments on the DRAMA and LingoQA benchmarks, demonstrating the effectiveness of VTS in achieving up to a 33\% improvement in inference throughput and a 28\% reduction in memory usage compared to the baseline without compromising performance.

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

TL;DR

Abstract

Paper Structure (23 sections, 5 equations, 3 figures, 4 tables)

This paper contains 23 sections, 5 equations, 3 figures, 4 tables.

Introduction
Related Works
Vision-Language Models for Driving
Token Reduction
Methodology
Temporal Reasoning
Video Token Sparsification (VTS)
Proposal Model
Key Frame Selection
Token Pruning
Training and Inference
Experiments and Results
Datasets
Evaluation Metrics
Implementation Details
...and 8 more sections

Figures (3)

Figure 1: Overview of Video Token Sparsification (VTS): The input consists of both the current frame and previous frames (referred to as Frame A and Frame B in this example). A CNN-based proposal model generates feature maps for each frame. VTS identifies a key frame (Frame A) based on these feature maps and computes a pruning score for each token in the non-key frames (Frame B), considering both saliency and dissimilarity to the corresponding tokens in the key frame. The top $s\%$ of tokens from the non-key frames are selected and concatenated with the tokens from the key frame to form the final sparsified token sequence, which is then input to the LLM for efficient reasoning.
Figure 2: Impact of token sparsification rate $s$ on inference throughput (request per second), memory consumption, and Lingo-Judge scores.
Figure 3: Visualization of VTS compared with other token-reduction-based approaches.

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

TL;DR

Abstract

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

Authors

TL;DR

Abstract

Table of Contents

Figures (3)