Table of Contents
Fetching ...

3D Shape Tokenization via Latent Flow Matching

Jen-Hao Rick Chang, Yuyang Wang, Miguel Angel Bautista Martin, Jiatao Gu, Xiaoming Zhao, Josh Susskind, Oncel Tuzel

TL;DR

This work introduces Shape Tokens (ST), a latent 3D representation that treats shapes as densities $p(\mathbf{x})$ in $\mathbb{R}^3$ concentrated on surfaces and learns them via latent flow matching. Each shape is encoded into a compact set $s$ of 1,024 tokens with 16 dimensions, enabling a continuous, scalable tokenizer that requires only point clouds and minimal preprocessing. The authors establish geometric connections through the flow, including zero-shot surface normal estimation, UVW-like mappings, and exact log-likelihoods for sampling, and demonstrate ST’s versatility across 3D-CLIP alignment, unconditional and image-conditioned 3D generation, and neural rendering. Empirical results on ShapeNet, Objaverse, and GSO show competitive performance with baselines while improving data efficiency and scalability, highlighting ST’s potential as a flexible, ML-friendly 3D representation for diverse downstream tasks.

Abstract

We introduce a latent 3D representation that models 3D surfaces as probability density functions in 3D, i.e., p(x,y,z), with flow-matching. Our representation is specifically designed for consumption by machine learning models, offering continuity and compactness by construction while requiring only point clouds and minimal data preprocessing. Despite being a data-driven method, our use of flow matching in the 3D space enables interesting geometry properties, including the capabilities to perform zero-shot estimation of surface normal and deformation field. We evaluate with several machine learning tasks, including 3D-CLIP, unconditional generative models, single-image conditioned generative model, and intersection-point estimation. Across all experiments, our models achieve competitive performance to existing baselines, while requiring less preprocessing and auxiliary information from training data.

3D Shape Tokenization via Latent Flow Matching

TL;DR

This work introduces Shape Tokens (ST), a latent 3D representation that treats shapes as densities in concentrated on surfaces and learns them via latent flow matching. Each shape is encoded into a compact set of 1,024 tokens with 16 dimensions, enabling a continuous, scalable tokenizer that requires only point clouds and minimal preprocessing. The authors establish geometric connections through the flow, including zero-shot surface normal estimation, UVW-like mappings, and exact log-likelihoods for sampling, and demonstrate ST’s versatility across 3D-CLIP alignment, unconditional and image-conditioned 3D generation, and neural rendering. Empirical results on ShapeNet, Objaverse, and GSO show competitive performance with baselines while improving data efficiency and scalability, highlighting ST’s potential as a flexible, ML-friendly 3D representation for diverse downstream tasks.

Abstract

We introduce a latent 3D representation that models 3D surfaces as probability density functions in 3D, i.e., p(x,y,z), with flow-matching. Our representation is specifically designed for consumption by machine learning models, offering continuity and compactness by construction while requiring only point clouds and minimal data preprocessing. Despite being a data-driven method, our use of flow matching in the 3D space enables interesting geometry properties, including the capabilities to perform zero-shot estimation of surface normal and deformation field. We evaluate with several machine learning tasks, including 3D-CLIP, unconditional generative models, single-image conditioned generative model, and intersection-point estimation. Across all experiments, our models achieve competitive performance to existing baselines, while requiring less preprocessing and auxiliary information from training data.

Paper Structure

This paper contains 43 sections, 5 equations, 20 figures, 11 tables.

Figures (20)

  • Figure 1: Shape Tokens can be readily used as input / target to machine learning models in various applications, including single-image-to-3D (left), neural rendering of normal maps (top right) and 3D-CLIP alignment (bottom right). Mesh credits whaarquitectos_librero_repisasrodiergabrielle_early_morningmadexc_domik_housebennettgraham_braceletandresblancof_gohome1.
  • Figure 2: Overview of our architecture. (Left) We model a 3D shape as a probability density function that is concentrated on the surface, forming a delta function in 3D. (Center) Our tokenizer uses flow matching to learn $p(xyz | s)$ and the shape tokenizer. (Right) The figure shows the total latent dimension and reconstruction error of various methods trained on ShapeNet dataset. Our tokenizers achieve better trade-off between compactness and reconstruction quality than baselines.
  • Figure 3: The ODE integration trajectory maps xyz (data) to uvw (noise). Mesh credits downs2022google.
  • Figure 4: Reconstruction, densification, and normal estimation of unseen point clouds in GSO dataset. For each row, we are given a point cloud containing 16,384 points (xyz only), we compute ST and i.i.d. sample the resulted $p(x|s)$ for 262,144 points. Different columns render the input and the sampled point clouds from different view points. Indicated by the label in the parenthesis, we color the input points according to their xyz coordinates and the sampled points according to their initial noise's uvw coordinates and their estimated normal (last two columns). Note that we do not provide normal as input to the shape tokenizer. Mesh credits downs2022google.
  • Figure 5: Single-image to 3D point cloud results on unseen meshes in Objaverse. We color the points with RGB color that indicates the original location of the point in the initial noise space. Mesh credits fedomo_ru_svetilnik_3885_25lasketchingsushi_cali_gardenbinkley_spacetrucker_galactic_truckstop_restroomsfedomo_ru_lyustra_2054_10panyaachan_red_glowing_mushroommartinice_group_op220667.
  • ...and 15 more figures