Table of Contents
Fetching ...

SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving

Seo Hyun Kim, Jin Bok Park, Do Yeon Koo, Ho Gun Park, Il Yong Chun

TL;DR

SToRM tackles the high computational cost of multi-modal LLMs in end-to-end autonomous driving by learning to reduce visual tokens under supervision. It introduces a lightweight, short-term spatio-temporal importance predictor and an anchor-context merging module, guided by pseudo-supervision signals derived from an all-token LLM pass. The framework enables end-to-end training and achieves driving performance on par with all-token baselines while reducing FLOPs by up to about 30x and cutting memory usage. On the LangAuto benchmark, SToRM outperforms other token-reduction approaches under the same reduced-token budget, demonstrating practical potential for real-time multimodal E2E driving.

Abstract

In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and numerous visual tokens from sensor inputs, which are limited in autonomous vehicles. Many MLLM studies have explored reducing visual tokens, but often suffer end-task performance degradation compared to using all tokens. To enable efficient E2E driving while maintaining performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for multi-modal LLMs (SToRM). The proposed framework consists of three key elements. First, a lightweight importance predictor with short-term sliding windows estimates token importance scores. Second, a supervised training approach uses an auxiliary path to obtain pseudo-supervision signals from an all-token LLM pass. Third, an anchor-context merging module partitions tokens into anchors and context tokens, and merges context tokens into relevant anchors to reduce redundancy while minimizing information loss. Experiments on the LangAuto benchmark show that SToRM outperforms state-of-the-art E2E driving MLLMs under the same reduced-token budget, maintaining all-token performance while reducing computational cost by up to 30x.

SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving

TL;DR

SToRM tackles the high computational cost of multi-modal LLMs in end-to-end autonomous driving by learning to reduce visual tokens under supervision. It introduces a lightweight, short-term spatio-temporal importance predictor and an anchor-context merging module, guided by pseudo-supervision signals derived from an all-token LLM pass. The framework enables end-to-end training and achieves driving performance on par with all-token baselines while reducing FLOPs by up to about 30x and cutting memory usage. On the LangAuto benchmark, SToRM outperforms other token-reduction approaches under the same reduced-token budget, demonstrating practical potential for real-time multimodal E2E driving.

Abstract

In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and numerous visual tokens from sensor inputs, which are limited in autonomous vehicles. Many MLLM studies have explored reducing visual tokens, but often suffer end-task performance degradation compared to using all tokens. To enable efficient E2E driving while maintaining performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for multi-modal LLMs (SToRM). The proposed framework consists of three key elements. First, a lightweight importance predictor with short-term sliding windows estimates token importance scores. Second, a supervised training approach uses an auxiliary path to obtain pseudo-supervision signals from an all-token LLM pass. Third, an anchor-context merging module partitions tokens into anchors and context tokens, and merges context tokens into relevant anchors to reduce redundancy while minimizing information loss. Experiments on the LangAuto benchmark show that SToRM outperforms state-of-the-art E2E driving MLLMs under the same reduced-token budget, maintaining all-token performance while reducing computational cost by up to 30x.
Paper Structure (18 sections, 9 equations, 3 figures, 6 tables)

This paper contains 18 sections, 9 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of the proposed SToRM framework. (a) SToRM is built on the central idea of leveraging intermediate results from an MLLM -- specifically, attention scores derived from all tokens -- as pseudo-supervision signals for training an importance predictor to reduce visual tokens. (b) In addition, we propose i) a new lightweight importance predictor learned from pseudo-supervision signals; and ii) an anchor-context token merging (ACM) module that reduces visual tokens while preserving essential information. We train SToRM in an E2E manner.
  • Figure 2: The overall architecture of the proposed SToRM framework (at each current time step $t$). The symbol c, WPs, and GT denote a concatenation operator, waypoints over the next $T_+$ frames, and ground truth, respectively. The overall architecture is described at the beginning of §\ref{['sec:methods']}.
  • Figure 3: The overall architectures of the proposed lightweight importance predictor and ACM module. (a) The proposed importance predictor consists of i)short-term spatio-temporal visual token mixing, ii) channel mixing, and iii) importance score computation. (b) The proposed short-term spatio-temporal visual token mixing mechanism with sliding windows. The input is the entire visual token matrix $\widetilde{{\mathbf{Z}}}$ in (\ref{['eq:token_concat']}), where the $\tau$th block of $N$ rows indicates the token-embedding matrix $\widetilde{{\mathbf{Z}}}_\tau$ at time step $\tau$. A purple shaded block denotes a set of short-term spatio-temporal visual tokens selected by a sliding window, $\widetilde{{\mathbf{Z}}}_{\mathcal{W}(\tau)}$ in (\ref{['eq:window_tokens']}); the one-dimensional checkerboard indicates a dilated sliding window. The output of MLP, ${\mathbf{H}}_{\tau}$ in (\ref{['eq:token_mixing']}), represents both spatial structure and short-time temporal evolution at $\tau$. (c) The ACM module comprises i) importance-based token categorization and ii) token merging: we first categorize visual tokens by predicted importance scores from (a), then merge "context" tokens into their most relevant "anchors" via cross-attention.