Table of Contents
Fetching ...

SPAFormer: Sequential 3D Part Assembly with Transformers

Boshen Xu, Sipeng Zheng, Qin Jin

TL;DR

SPAFormer tackles the combinatorial explosion in 3D part assembly by conditioning on assembly sequences as weak constraints and employing a transformer-based architecture. It introduces knowledge-enhancement encodings (order, relation, symmetry) and two generator variants (parallel and autoregressive) to predict part poses from 3D point clouds. The approach is validated on PartNet-Assembly, a 21-category benchmark, showing superior generalization, especially for multi-task and long-horizon assembly and competitive performance with visually conditioned methods. Qualitative analyses and a real-world Redwood dataset demonstration underscore its practical potential and outline directions for future work.

Abstract

We introduce SPAFormer, an innovative model designed to overcome the combinatorial explosion challenge in the 3D Part Assembly (3D-PA) task. This task requires accurate prediction of each part's poses in sequential steps. As the number of parts increases, the possible assembly combinations increase exponentially, leading to a combinatorial explosion that severely hinders the efficacy of 3D-PA. SPAFormer addresses this problem by leveraging weak constraints from assembly sequences, effectively reducing the solution space's complexity. Since the sequence of parts conveys construction rules similar to sentences structured through words, our model explores both parallel and autoregressive generation. We further strengthen SPAFormer through knowledge enhancement strategies that utilize the attributes of parts and their sequence information, enabling it to capture the inherent assembly pattern and relationships among sequentially ordered parts. We also construct a more challenging benchmark named PartNet-Assembly covering 21 varied categories to more comprehensively validate the effectiveness of SPAFormer. Extensive experiments demonstrate the superior generalization capabilities of SPAFormer, particularly with multi-tasking and in scenarios requiring long-horizon assembly. Code is available at https://github.com/xuboshen/SPAFormer.

SPAFormer: Sequential 3D Part Assembly with Transformers

TL;DR

SPAFormer tackles the combinatorial explosion in 3D part assembly by conditioning on assembly sequences as weak constraints and employing a transformer-based architecture. It introduces knowledge-enhancement encodings (order, relation, symmetry) and two generator variants (parallel and autoregressive) to predict part poses from 3D point clouds. The approach is validated on PartNet-Assembly, a 21-category benchmark, showing superior generalization, especially for multi-task and long-horizon assembly and competitive performance with visually conditioned methods. Qualitative analyses and a real-world Redwood dataset demonstration underscore its practical potential and outline directions for future work.

Abstract

We introduce SPAFormer, an innovative model designed to overcome the combinatorial explosion challenge in the 3D Part Assembly (3D-PA) task. This task requires accurate prediction of each part's poses in sequential steps. As the number of parts increases, the possible assembly combinations increase exponentially, leading to a combinatorial explosion that severely hinders the efficacy of 3D-PA. SPAFormer addresses this problem by leveraging weak constraints from assembly sequences, effectively reducing the solution space's complexity. Since the sequence of parts conveys construction rules similar to sentences structured through words, our model explores both parallel and autoregressive generation. We further strengthen SPAFormer through knowledge enhancement strategies that utilize the attributes of parts and their sequence information, enabling it to capture the inherent assembly pattern and relationships among sequentially ordered parts. We also construct a more challenging benchmark named PartNet-Assembly covering 21 varied categories to more comprehensively validate the effectiveness of SPAFormer. Extensive experiments demonstrate the superior generalization capabilities of SPAFormer, particularly with multi-tasking and in scenarios requiring long-horizon assembly. Code is available at https://github.com/xuboshen/SPAFormer.
Paper Structure (17 sections, 9 equations, 10 figures, 9 tables)

This paper contains 17 sections, 9 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: We propose SPAFormer, a novel sequence-conditioned transformer-based method to assemble objects from given parts with 3D point clouds. SPAFormer efficiently leverages the part geometry and sequence information, achieving significantly more plausible assemblies on our constructed benchmark PartNet-Assembly than other baseline methods including ScorePA cheng2023scorepa, DGL HuangZhan2020DGL, and RGL narayan2022rgl.
  • Figure 2: Illustration of the combinatorial explosion challenge inherent in the assembly process. Specifically: (a) For an object composed of $n$ parts, where we assume each part can occupy one of $m$ discrete positions, the potential combinations of these parts grow at an extraordinary rate, exceeding $O(m^n)$ in complexity. (b) the number of constituent parts increases when the target object for assembly becomes more complex.
  • Figure 3: Illustration of overall end-to-end framework of SPAFormer. (a) The shared 3D backbone extracts the geometry feature of individual parts, followed by (b) knowledge enhancement of part features, which incorporates symmetry, order, and relation information into part features through positional encodings, then generates poses by either (c1) parallel generator, which generates poses of all parts at once, or (c2) autoregressive generator, which decodes poses of parts according to assembly sequences step by step.
  • Figure 4: Visualizations of assembly results when enhancing knowledge by adding new encoding patterns in a stepwise way.
  • Figure 5: Comparison of varied assembly length. Our model presents notable improvements particularly in long-horizon assembly ($>$10 parts) when the model is enhanced by incorporating OEnc and REnc (+C), as well as SEnc (+S).
  • ...and 5 more figures