Table of Contents
Fetching ...

TORA: Topological Representation Alignment for 3D Shape Assembly

Nahyuk Lee, Zhiang Chen, Marc Pollefeys, Sunghwan Hong

Abstract

Flow-matching methods for 3D shape assembly learn point-wise velocity fields that transport parts toward assembled configurations, yet they receive no explicit guidance about which cross-part interactions should drive the motion. We introduce TORA, a topology-first representation alignment framework that distills relational structure from a frozen pretrained 3D encoder into the flow-matching backbone during training. We first realize this via simple instantiation, token-wise cosine matching, which injects the learned geometric descriptors from the teacher representation. We then extend to employ a Centered Kernel Alignment (CKA) loss to match the similarity structure between student and teacher representations for enhanced topological alignment. Through systematic probing of diverse 3D encoders, we show that geometry- and contact-centric teacher properties, not semantic classification ability, govern alignment effectiveness, and that alignment is most beneficial at later transformer layers where spatial structure naturally emerges. TORA introduces zero inference overhead while yielding two consistent benefits: faster convergence (up to 6.9$\times$) and improved accuracy in-distribution, along with greater robustness under domain shift. Experiments on five benchmarks spanning geometric, semantic, and inter-object assembly demonstrate state-of-the-art performance, with particularly pronounced gains in zero-shot transfer to unseen real-world and synthetic datasets. Project page: https://nahyuklee.github.io/tora.

TORA: Topological Representation Alignment for 3D Shape Assembly

Abstract

Flow-matching methods for 3D shape assembly learn point-wise velocity fields that transport parts toward assembled configurations, yet they receive no explicit guidance about which cross-part interactions should drive the motion. We introduce TORA, a topology-first representation alignment framework that distills relational structure from a frozen pretrained 3D encoder into the flow-matching backbone during training. We first realize this via simple instantiation, token-wise cosine matching, which injects the learned geometric descriptors from the teacher representation. We then extend to employ a Centered Kernel Alignment (CKA) loss to match the similarity structure between student and teacher representations for enhanced topological alignment. Through systematic probing of diverse 3D encoders, we show that geometry- and contact-centric teacher properties, not semantic classification ability, govern alignment effectiveness, and that alignment is most beneficial at later transformer layers where spatial structure naturally emerges. TORA introduces zero inference overhead while yielding two consistent benefits: faster convergence (up to 6.9) and improved accuracy in-distribution, along with greater robustness under domain shift. Experiments on five benchmarks spanning geometric, semantic, and inter-object assembly demonstrate state-of-the-art performance, with particularly pronounced gains in zero-shot transfer to unseen real-world and synthetic datasets. Project page: https://nahyuklee.github.io/tora.

Paper Structure

This paper contains 29 sections, 18 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: Multi-part assembly results across regimes. We compare the RPF baseline with our alignment variants. We show that casting the training to a teacher-student distillation for injecting pretrained geometric priors consistently improves performance.
  • Figure 2: Overview of the Topological Representation Alignment (TORA) framework. TORA distills relational geometric structures from a frozen 3D foundation teacher into a flow-matching student during training. By matching Gram-based similarity matrices via Centered Kernel Alignment (CKA), the student learns the pairwise "who-is-similar-to-whom" relational topology of parts. As detailed in Sec. \ref{['sec:experiments']}, this structural distillation significantly accelerates convergence and enhances robustness under domain shift, while incurring strictly zero overhead during inference.
  • Figure 3: Conceptual illustration of alignment objectives. Blue and red dots denote student tokens $\hat{\mathbf{h}}$ and teacher tokens $\mathbf{y}$, respectively. NT-Xent enforces per-point discriminability via positive/negative pairing, and cosine distance independently aligns each token pair. CKA objective matches the pairwise similarity structures (Gram matrices $\tilde{\mathbf{G}}_S$, $\tilde{\mathbf{G}}_T$), preserving relational topology rather than individual feature vectors.
  • Figure 4: Correlation Analysis of Teacher Representations. We analyze the relationship between representation properties and assembly performance. (a-b) Global semantic understanding (object classification) exhibits a much weaker correlation with final shape assembly accuracy in comparison to spatial structure awareness (mating surface segmentation). (c-d) Local spatial structure metrics such as LDS struggle to identify performant teachers, whereas measures highlighting particular geometric properties depict much clearer trends in assembly performance. Overall, geometry- and contact-centric teacher properties are more indicative of downstream assembly quality, motivating our structural distillation objective.
  • Figure 5: Impact of different teachers on distillation. We compare the Part Accuracy of TORA across $\mathcal{L}_{\text{cos-dist}}$ and $\mathcal{L}_{\text{CKA}}$ across various 3D foundation models as teachers on Breaking Bad dataset sellan2022breaking. The dashed line indicates the RPF baseline sun2025rectified.
  • ...and 15 more figures