3D Shape Tokenization via Latent Flow Matching

Jen-Hao Rick Chang; Yuyang Wang; Miguel Angel Bautista Martin; Jiatao Gu; Xiaoming Zhao; Josh Susskind; Oncel Tuzel

3D Shape Tokenization via Latent Flow Matching

Jen-Hao Rick Chang, Yuyang Wang, Miguel Angel Bautista Martin, Jiatao Gu, Xiaoming Zhao, Josh Susskind, Oncel Tuzel

TL;DR

This work introduces Shape Tokens (ST), a latent 3D representation that treats shapes as densities $p(\mathbf{x})$ in $\mathbb{R}^3$ concentrated on surfaces and learns them via latent flow matching. Each shape is encoded into a compact set $s$ of 1,024 tokens with 16 dimensions, enabling a continuous, scalable tokenizer that requires only point clouds and minimal preprocessing. The authors establish geometric connections through the flow, including zero-shot surface normal estimation, UVW-like mappings, and exact log-likelihoods for sampling, and demonstrate ST’s versatility across 3D-CLIP alignment, unconditional and image-conditioned 3D generation, and neural rendering. Empirical results on ShapeNet, Objaverse, and GSO show competitive performance with baselines while improving data efficiency and scalability, highlighting ST’s potential as a flexible, ML-friendly 3D representation for diverse downstream tasks.

Abstract

We introduce a latent 3D representation that models 3D surfaces as probability density functions in 3D, i.e., p(x,y,z), with flow-matching. Our representation is specifically designed for consumption by machine learning models, offering continuity and compactness by construction while requiring only point clouds and minimal data preprocessing. Despite being a data-driven method, our use of flow matching in the 3D space enables interesting geometry properties, including the capabilities to perform zero-shot estimation of surface normal and deformation field. We evaluate with several machine learning tasks, including 3D-CLIP, unconditional generative models, single-image conditioned generative model, and intersection-point estimation. Across all experiments, our models achieve competitive performance to existing baselines, while requiring less preprocessing and auxiliary information from training data.

3D Shape Tokenization via Latent Flow Matching

TL;DR

This work introduces Shape Tokens (ST), a latent 3D representation that treats shapes as densities

concentrated on surfaces and learns them via latent flow matching. Each shape is encoded into a compact set

of 1,024 tokens with 16 dimensions, enabling a continuous, scalable tokenizer that requires only point clouds and minimal preprocessing. The authors establish geometric connections through the flow, including zero-shot surface normal estimation, UVW-like mappings, and exact log-likelihoods for sampling, and demonstrate ST’s versatility across 3D-CLIP alignment, unconditional and image-conditioned 3D generation, and neural rendering. Empirical results on ShapeNet, Objaverse, and GSO show competitive performance with baselines while improving data efficiency and scalability, highlighting ST’s potential as a flexible, ML-friendly 3D representation for diverse downstream tasks.

3D Shape Tokenization via Latent Flow Matching

TL;DR

Abstract

3D Shape Tokenization via Latent Flow Matching

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (20)