Table of Contents
Fetching ...

Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression

Dingyuan Zhang, Dingkang Liang, Zichang Tan, Xiaoqing Ye, Cheng Zhang, Jingdong Wang, Xiang Bai

TL;DR

This work explores the efficient ViT backbones for multi-view 3D detection via token compression and proposes a simple yet effective method called TokenCompression3D (ToC3D), which can nearly maintain the performance of recent SOTA with up to 30% inference speedup.

Abstract

Slow inference speed is one of the most crucial concerns for deploying multi-view 3D detectors to tasks with high real-time requirements like autonomous driving. Although many sparse query-based methods have already attempted to improve the efficiency of 3D detectors, they neglect to consider the backbone, especially when using Vision Transformers (ViT) for better performance. To tackle this problem, we explore the efficient ViT backbones for multi-view 3D detection via token compression and propose a simple yet effective method called TokenCompression3D (ToC3D). By leveraging history object queries as foreground priors of high quality, modeling 3D motion information in them, and interacting them with image tokens through the attention mechanism, ToC3D can effectively determine the magnitude of information densities of image tokens and segment the salient foreground tokens. With the introduced dynamic router design, ToC3D can weigh more computing resources to important foreground tokens while compressing the information loss, leading to a more efficient ViT-based multi-view 3D detector. Extensive results on the large-scale nuScenes dataset show that our method can nearly maintain the performance of recent SOTA with up to 30% inference speedup, and the improvements are consistent after scaling up the ViT and input resolution. The code will be made at https://github.com/DYZhang09/ToC3D.

Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression

TL;DR

This work explores the efficient ViT backbones for multi-view 3D detection via token compression and proposes a simple yet effective method called TokenCompression3D (ToC3D), which can nearly maintain the performance of recent SOTA with up to 30% inference speedup.

Abstract

Slow inference speed is one of the most crucial concerns for deploying multi-view 3D detectors to tasks with high real-time requirements like autonomous driving. Although many sparse query-based methods have already attempted to improve the efficiency of 3D detectors, they neglect to consider the backbone, especially when using Vision Transformers (ViT) for better performance. To tackle this problem, we explore the efficient ViT backbones for multi-view 3D detection via token compression and propose a simple yet effective method called TokenCompression3D (ToC3D). By leveraging history object queries as foreground priors of high quality, modeling 3D motion information in them, and interacting them with image tokens through the attention mechanism, ToC3D can effectively determine the magnitude of information densities of image tokens and segment the salient foreground tokens. With the introduced dynamic router design, ToC3D can weigh more computing resources to important foreground tokens while compressing the information loss, leading to a more efficient ViT-based multi-view 3D detector. Extensive results on the large-scale nuScenes dataset show that our method can nearly maintain the performance of recent SOTA with up to 30% inference speedup, and the improvements are consistent after scaling up the ViT and input resolution. The code will be made at https://github.com/DYZhang09/ToC3D.
Paper Structure (23 sections, 10 equations, 7 figures, 9 tables)

This paper contains 23 sections, 10 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: (a) We trim ViTs by focusing on the foreground tokens with the aid of motion cues. (b) Our method reports an ideal trade-off between performance and latency.
  • Figure 2: (a) The overall architecture of ToC3D, which trims each block of ViT backbone through two designs: Motion Query-guided Token Selection strategy (MQTS) and dynamic router. (b) MQTS takes motion queries from history frames as inputs, calculates the importance score, and splits image tokens into salient and redundant tokens. Dynamic router passes these tokens to different paths for efficient feature extraction.
  • Figure 3: Effect of $N_q$ on the nuScenes val set.
  • Figure 4: Effect of keeping ratios on the nuScenes val set.
  • Figure 5: The visualization of our method (better viewed in color). We visualize the attention map in importance score calculation on the left and the salient/redundant tokens after the top-k selection on the right. Redundant tokens are illustrated as translucent.
  • ...and 2 more figures