Table of Contents
Fetching ...

How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?

Tuan Anh Tran, Duy M. H. Nguyen, Hoai-Chau Tran, Michael Barz, Khoa D. Doan, Roger Wattenhofer, Ngo Anh Vien, Mathias Niepert, Daniel Sonntag, Paul Swoboda

TL;DR

This work reveals significant token redundancy in state-of-the-art 3D point cloud transformers, showing that dense tokenization is not strictly necessary for high performance. It introduces gitmerge3D, a globally informed graph token merging approach, and a 3D-aware adaptive merging strategy that can remove up to 90-95% of tokens with minimal accuracy loss. The method yields large reductions in FLOPs and memory, validated across semantic segmentation, reconstruction, and language-guided detection tasks, sometimes even improving efficiency with modest fine-tuning. Overall, the paper advocates a shift from token quantity to token quality in 3D transformers, enabling scalable and deployable 3D foundation architectures, with code and checkpoints publicly released.

Abstract

Recent advances in 3D point cloud transformers have led to state-of-the-art results in tasks such as semantic segmentation and reconstruction. However, these models typically rely on dense token representations, incurring high computational and memory costs during training and inference. In this work, we present the finding that tokens are remarkably redundant, leading to substantial inefficiency. We introduce gitmerge3D, a globally informed graph token merging method that can reduce the token count by up to 90-95% while maintaining competitive performance. This finding challenges the prevailing assumption that more tokens inherently yield better performance and highlights that many current models are over-tokenized and under-optimized for scalability. We validate our method across multiple 3D vision tasks and show consistent improvements in computational efficiency. This work is the first to assess redundancy in large-scale 3D transformer models, providing insights into the development of more efficient 3D foundation architectures. Our code and checkpoints are publicly available at https://gitmerge3d.github.io

How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?

TL;DR

This work reveals significant token redundancy in state-of-the-art 3D point cloud transformers, showing that dense tokenization is not strictly necessary for high performance. It introduces gitmerge3D, a globally informed graph token merging approach, and a 3D-aware adaptive merging strategy that can remove up to 90-95% of tokens with minimal accuracy loss. The method yields large reductions in FLOPs and memory, validated across semantic segmentation, reconstruction, and language-guided detection tasks, sometimes even improving efficiency with modest fine-tuning. Overall, the paper advocates a shift from token quantity to token quality in 3D transformers, enabling scalable and deployable 3D foundation architectures, with code and checkpoints publicly released.

Abstract

Recent advances in 3D point cloud transformers have led to state-of-the-art results in tasks such as semantic segmentation and reconstruction. However, these models typically rely on dense token representations, incurring high computational and memory costs during training and inference. In this work, we present the finding that tokens are remarkably redundant, leading to substantial inefficiency. We introduce gitmerge3D, a globally informed graph token merging method that can reduce the token count by up to 90-95% while maintaining competitive performance. This finding challenges the prevailing assumption that more tokens inherently yield better performance and highlights that many current models are over-tokenized and under-optimized for scalability. We validate our method across multiple 3D vision tasks and show consistent improvements in computational efficiency. This work is the first to assess redundancy in large-scale 3D transformer models, providing insights into the development of more efficient 3D foundation architectures. Our code and checkpoints are publicly available at https://gitmerge3d.github.io

Paper Structure

This paper contains 28 sections, 13 equations, 13 figures, 12 tables, 1 algorithm.

Figures (13)

  • Figure 1: We compare the original PTv3 and Sonata with our proposed token merging method (PTv3 + Ours and Sonata + Ours) in terms of FLOPs and memory consumption. Despite merging up to 90% of tokens, our method applied to PTv3 achieves a 5.3 x reduction in FLOPs (from 107.5 GFLOPs to 19.9 GFLOPs) and a 6.4 x reduction in memory usage (from 10.12 GB to 1.6 GB), with minimal performance degradation. Notably, the model maintains comparable accuracy when fine-tuned by updating only the MLPs before and after the attention layer for just 10% of the original training epochs, while requiring significantly less computation per epoch during fine-tuning.
  • Figure 2: Observation: After merging 90% of the tokens in each attention layer, the change in PCA visualization of feature representation (3rd image) is minimal compared to the original feature (2nd image). Most of the predictions remain unchanged after merging, with red indicating the areas where predictions differ. This leads us to conclude that there is high redundancy in the point cloud processing model.
  • Figure 3: b) For each Point Transformer layer, we compute token energy scores and propagate them to patches using a globally informed graph over the local self-attention. a) These patch-level scores guide adaptive merging, retaining more information for high-energy patches. c) Each patch is divided into evenly sized bins, and destination tokens are randomly selected within these bins to enable spatially aware merging.
  • Figure 4: Off-the-shelf performance comparison between our merging against existing methods on PTv3 Sonata and PTv3 across three datasets ScanNet, ScanNet-200 and S3DIS. The numbers above each data point indicate the merging rate.
  • Figure 5: 3D Object reconstruction: Off-the-shell performance of MAYC on Objaverse deitke2023objaverse and Google Scanned Object (GSO) GSO.
  • ...and 8 more figures