Table of Contents
Fetching ...

Hierarchical Transformers for Unsupervised 3D Shape Abstraction

Aditya Vora, Lily Goli, Andrea Tagliasacchi, Hao Zhang

TL;DR

HiT tackles unsupervised hierarchical 3D shape abstraction by learning multi-level part decompositions across diverse categories using a hierarchical transformer with per-level codebooks and cross-attention to establish soft parent–child relations. Each part is grounded as a 3D convex primitive, with containment constraints and a reconstruction-based objective that includes convex regularization and tree-balancing terms. The approach yields coherent, coarse-to-fine shape representations and achieves state-of-the-art unsupervised part segmentation on ShapeNet/PartNet, while enabling cross-category hierarchies without labels. This scalable, interpretable hierarchy supports improved shape editing, manipulation, and analysis, with potential extensions to adaptive hierarchies and generative modeling.

Abstract

We introduce HiT, a novel hierarchical neural field representation for 3D shapes that learns general hierarchies in a coarse-to-fine manner across different shape categories in an unsupervised setting. Our key contribution is a hierarchical transformer (HiT), where each level learns parent-child relationships of the tree hierarchy using a compressed codebook. This codebook enables the network to automatically identify common substructures across potentially diverse shape categories. Unlike previous works that constrain the task to a fixed hierarchical structure (e.g., binary), we impose no such restriction, except for limiting the total number of nodes at each tree level. This flexibility allows our method to infer the hierarchical structure directly from data, over multiple shape categories, and representing more general and complex hierarchies than prior approaches. When trained at scale with a reconstruction loss, our model captures meaningful containment relationships between parent and child nodes. We demonstrate its effectiveness through an unsupervised shape segmentation task over all 55 ShapeNet categories, where our method successfully segments shapes into multiple levels of granularity.

Hierarchical Transformers for Unsupervised 3D Shape Abstraction

TL;DR

HiT tackles unsupervised hierarchical 3D shape abstraction by learning multi-level part decompositions across diverse categories using a hierarchical transformer with per-level codebooks and cross-attention to establish soft parent–child relations. Each part is grounded as a 3D convex primitive, with containment constraints and a reconstruction-based objective that includes convex regularization and tree-balancing terms. The approach yields coherent, coarse-to-fine shape representations and achieves state-of-the-art unsupervised part segmentation on ShapeNet/PartNet, while enabling cross-category hierarchies without labels. This scalable, interpretable hierarchy supports improved shape editing, manipulation, and analysis, with potential extensions to adaptive hierarchies and generative modeling.

Abstract

We introduce HiT, a novel hierarchical neural field representation for 3D shapes that learns general hierarchies in a coarse-to-fine manner across different shape categories in an unsupervised setting. Our key contribution is a hierarchical transformer (HiT), where each level learns parent-child relationships of the tree hierarchy using a compressed codebook. This codebook enables the network to automatically identify common substructures across potentially diverse shape categories. Unlike previous works that constrain the task to a fixed hierarchical structure (e.g., binary), we impose no such restriction, except for limiting the total number of nodes at each tree level. This flexibility allows our method to infer the hierarchical structure directly from data, over multiple shape categories, and representing more general and complex hierarchies than prior approaches. When trained at scale with a reconstruction loss, our model captures meaningful containment relationships between parent and child nodes. We demonstrate its effectiveness through an unsupervised shape segmentation task over all 55 ShapeNet categories, where our method successfully segments shapes into multiple levels of granularity.

Paper Structure

This paper contains 27 sections, 11 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: We propose a hierarchical transformer that learns part codebooks at each level, representing shapes from coarse to fine when trained across shapes. Cross-attention "connects" levels, establishing learnable part–subpart relationships. The decoded parts are mapped to 3D convex primitives that provide geometric explanations. An example decomposition of a lamp is shown across three levels, from a coarse base and shade to finer structural details.
  • Figure 2: We outperform all baselines in the part segmentation task on ShapeNet, both qualitatively and quantitatively (IoU $\uparrow$). Our dynamic tree structure adapts to geometry variations within a category (e.g., chairs), discovering a varying number of parts, while fixed-tree baselines fail to capture such differences.
  • Figure 3: Qualitative and quantitative (IoU $\uparrow$) results on the ShapeNet dataset show that our method achieves improved part segmentation by accurately reconstructing and consistently recovering recurring parts, whereas baselines often misclassify or miss them entirely.
  • Figure 4: Our hierarchical part segmentation and reconstruction method produces a coherent multi-level shape abstraction: higher levels represent the main structural components, while finer levels capture detailed sub-parts, yielding more accurate reconstructions than prior approaches. The color maps show parent-child relationship between levels.
  • Figure 5: t-SNE visualization of subpart features $\mathbf{Z}^{(\ell)}$ across the ShapeNet test set shows that embeddings for each part (color-coded) form coherent clusters in the embedding space.
  • ...and 6 more figures