Table of Contents
Fetching ...

Pointy - A Lightweight Transformer for Point Cloud Foundation Models

Konrad Szafer, Marek Kraft, Dominik Belter

TL;DR

This work introduces a lightweight transformer-based point cloud architecture that is trained only on 39k point clouds - yet it outperforms several larger foundation models trained on over 200k training samples, demonstrating the value of a carefully curated training setup and architecture.

Abstract

Foundation models for point cloud data have recently grown in capability, often leveraging extensive representation learning from language or vision. In this work, we take a more controlled approach by introducing a lightweight transformer-based point cloud architecture. In contrast to the heavy reliance on cross-modal supervision, our model is trained only on 39k point clouds - yet it outperforms several larger foundation models trained on over 200k training samples. Interestingly, our method approaches state-of-the-art results from models that have seen over a million point clouds, images, and text samples, demonstrating the value of a carefully curated training setup and architecture. To ensure rigorous evaluation, we conduct a comprehensive replication study that standardizes the training regime and benchmarks across multiple point cloud architectures. This unified experimental framework isolates the impact of architectural choices, allowing for transparent comparisons and highlighting the benefits of our design and other tokenizer-free architectures. Our results show that simple backbones can deliver competitive results to more complex or data-rich strategies. The implementation, including code, pre-trained models, and training protocols, is available at https://github.com/KonradSzafer/Pointy.

Pointy - A Lightweight Transformer for Point Cloud Foundation Models

TL;DR

This work introduces a lightweight transformer-based point cloud architecture that is trained only on 39k point clouds - yet it outperforms several larger foundation models trained on over 200k training samples, demonstrating the value of a carefully curated training setup and architecture.

Abstract

Foundation models for point cloud data have recently grown in capability, often leveraging extensive representation learning from language or vision. In this work, we take a more controlled approach by introducing a lightweight transformer-based point cloud architecture. In contrast to the heavy reliance on cross-modal supervision, our model is trained only on 39k point clouds - yet it outperforms several larger foundation models trained on over 200k training samples. Interestingly, our method approaches state-of-the-art results from models that have seen over a million point clouds, images, and text samples, demonstrating the value of a carefully curated training setup and architecture. To ensure rigorous evaluation, we conduct a comprehensive replication study that standardizes the training regime and benchmarks across multiple point cloud architectures. This unified experimental framework isolates the impact of architectural choices, allowing for transparent comparisons and highlighting the benefits of our design and other tokenizer-free architectures. Our results show that simple backbones can deliver competitive results to more complex or data-rich strategies. The implementation, including code, pre-trained models, and training protocols, is available at https://github.com/KonradSzafer/Pointy.
Paper Structure (19 sections, 5 figures, 4 tables)

This paper contains 19 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Architecture Pointy -- transformer backbone for point cloud processing. The model takes raw point cloud data as input, applies patch partitioning using Farthest Point Sampling (FPS), and k-Nearest Neighbors (kNN). Raw point features are preserved through residual connections alongside learned patch embeddings in the embedding layer based on PointNet. The architecture consists of token merging operations between adjacent tokens after each transformer block. This hierarchical design enables local and global feature learning through progressive patch merging. The final output produces P$\times$D dimensional representations, where P is the number of patches and D is the embedding dimension.
  • Figure 2: ModelNet40
  • Figure 3: ScanObjectNN
  • Figure 4: Objaverse-LVIS
  • Figure 6: Classification accuracy on ModelNet40 as a function of input point cloud size. Models were trained for 30 epochs under identical conditions, with results showing the peak accuracy achieved. While PCT demonstrates superior performance in the 256-1024 point range, our architecture achieves competitive results and attains 89.3% for 2048 points.