Pointy - A Lightweight Transformer for Point Cloud Foundation Models

Konrad Szafer; Marek Kraft; Dominik Belter

Pointy - A Lightweight Transformer for Point Cloud Foundation Models

Konrad Szafer, Marek Kraft, Dominik Belter

TL;DR

This work introduces a lightweight transformer-based point cloud architecture that is trained only on 39k point clouds - yet it outperforms several larger foundation models trained on over 200k training samples, demonstrating the value of a carefully curated training setup and architecture.

Abstract

Foundation models for point cloud data have recently grown in capability, often leveraging extensive representation learning from language or vision. In this work, we take a more controlled approach by introducing a lightweight transformer-based point cloud architecture. In contrast to the heavy reliance on cross-modal supervision, our model is trained only on 39k point clouds - yet it outperforms several larger foundation models trained on over 200k training samples. Interestingly, our method approaches state-of-the-art results from models that have seen over a million point clouds, images, and text samples, demonstrating the value of a carefully curated training setup and architecture. To ensure rigorous evaluation, we conduct a comprehensive replication study that standardizes the training regime and benchmarks across multiple point cloud architectures. This unified experimental framework isolates the impact of architectural choices, allowing for transparent comparisons and highlighting the benefits of our design and other tokenizer-free architectures. Our results show that simple backbones can deliver competitive results to more complex or data-rich strategies. The implementation, including code, pre-trained models, and training protocols, is available at https://github.com/KonradSzafer/Pointy.

Pointy - A Lightweight Transformer for Point Cloud Foundation Models

TL;DR

Abstract

Paper Structure (19 sections, 5 figures, 4 tables)

This paper contains 19 sections, 5 figures, 4 tables.

Introduction
Related Work
Pointy
Point Embeddings
Transformer Backbone
Experimental Setup
Datasets and Tasks
ModelNet40 wu20153d.
ScanObjectNN uy2019revisiting.
Objaverse-LVIS Subset deitke2023objaverse.
Baselines
Implementation Details
Results
Classification
Pre-training
...and 4 more sections

Figures (5)

Figure 1: Architecture Pointy -- transformer backbone for point cloud processing. The model takes raw point cloud data as input, applies patch partitioning using Farthest Point Sampling (FPS), and k-Nearest Neighbors (kNN). Raw point features are preserved through residual connections alongside learned patch embeddings in the embedding layer based on PointNet. The architecture consists of token merging operations between adjacent tokens after each transformer block. This hierarchical design enables local and global feature learning through progressive patch merging. The final output produces P$\times$D dimensional representations, where P is the number of patches and D is the embedding dimension.
Figure 2: ModelNet40
Figure 3: ScanObjectNN
Figure 4: Objaverse-LVIS
Figure 6: Classification accuracy on ModelNet40 as a function of input point cloud size. Models were trained for 30 epochs under identical conditions, with results showing the peak accuracy achieved. While PCT demonstrates superior performance in the 256-1024 point range, our architecture achieves competitive results and attains 89.3% for 2048 points.

Pointy - A Lightweight Transformer for Point Cloud Foundation Models

TL;DR

Abstract

Pointy - A Lightweight Transformer for Point Cloud Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)