3D Shape Tokenization via Latent Flow Matching
Jen-Hao Rick Chang, Yuyang Wang, Miguel Angel Bautista Martin, Jiatao Gu, Xiaoming Zhao, Josh Susskind, Oncel Tuzel
TL;DR
This work introduces Shape Tokens (ST), a latent 3D representation that treats shapes as densities $p(\mathbf{x})$ in $\mathbb{R}^3$ concentrated on surfaces and learns them via latent flow matching. Each shape is encoded into a compact set $s$ of 1,024 tokens with 16 dimensions, enabling a continuous, scalable tokenizer that requires only point clouds and minimal preprocessing. The authors establish geometric connections through the flow, including zero-shot surface normal estimation, UVW-like mappings, and exact log-likelihoods for sampling, and demonstrate ST’s versatility across 3D-CLIP alignment, unconditional and image-conditioned 3D generation, and neural rendering. Empirical results on ShapeNet, Objaverse, and GSO show competitive performance with baselines while improving data efficiency and scalability, highlighting ST’s potential as a flexible, ML-friendly 3D representation for diverse downstream tasks.
Abstract
We introduce a latent 3D representation that models 3D surfaces as probability density functions in 3D, i.e., p(x,y,z), with flow-matching. Our representation is specifically designed for consumption by machine learning models, offering continuity and compactness by construction while requiring only point clouds and minimal data preprocessing. Despite being a data-driven method, our use of flow matching in the 3D space enables interesting geometry properties, including the capabilities to perform zero-shot estimation of surface normal and deformation field. We evaluate with several machine learning tasks, including 3D-CLIP, unconditional generative models, single-image conditioned generative model, and intersection-point estimation. Across all experiments, our models achieve competitive performance to existing baselines, while requiring less preprocessing and auxiliary information from training data.
