Table of Contents
Fetching ...

Manual-PA: Learning 3D Part Assembly from Instruction Diagrams

Jiahao Zhang, Anoop Cherian, Cristian Rodriguez, Weijian Deng, Stephen Gould

TL;DR

This work addresses automatic 3D furniture assembly guided by diagrammatic manuals, framing it as a discrete-continuous optimization problem. It introduces Manual-PA, a transformer-based framework that first aligns 3D parts with step diagrams using contrastive learning and Hungarian-based permutation, then predicts the 6DoF poses through a cross-attentive decoder guided by the learned order. The approach achieves state-of-the-art results on PartNet and demonstrates strong generalization to real-world IKEA manuals, including zero-shot transfer. Ablation and visualization analyses show that explicit order guidance and cross-modal alignment are crucial for accurate assembly, enabling practical, diagram-guided 3D assembly systems.

Abstract

Assembling furniture amounts to solving the discrete-continuous optimization task of selecting the furniture parts to assemble and estimating their connecting poses in a physically realistic manner. The problem is hampered by its combinatorially large yet sparse solution space thus making learning to assemble a challenging task for current machine learning models. In this paper, we attempt to solve this task by leveraging the assembly instructions provided in diagrammatic manuals that typically accompany the furniture parts. Our key insight is to use the cues in these diagrams to split the problem into discrete and continuous phases. Specifically, we present Manual-PA, a transformer-based instruction Manual-guided 3D Part Assembly framework that learns to semantically align 3D parts with their illustrations in the manuals using a contrastive learning backbone towards predicting the assembly order and infers the 6D pose of each part via relating it to the final furniture depicted in the manual. To validate the efficacy of our method, we conduct experiments on the benchmark PartNet dataset. Our results show that using the diagrams and the order of the parts lead to significant improvements in assembly performance against the state of the art. Further, Manual-PA demonstrates strong generalization to real-world IKEA furniture assembly on the IKEA-Manual dataset.

Manual-PA: Learning 3D Part Assembly from Instruction Diagrams

TL;DR

This work addresses automatic 3D furniture assembly guided by diagrammatic manuals, framing it as a discrete-continuous optimization problem. It introduces Manual-PA, a transformer-based framework that first aligns 3D parts with step diagrams using contrastive learning and Hungarian-based permutation, then predicts the 6DoF poses through a cross-attentive decoder guided by the learned order. The approach achieves state-of-the-art results on PartNet and demonstrates strong generalization to real-world IKEA manuals, including zero-shot transfer. Ablation and visualization analyses show that explicit order guidance and cross-modal alignment are crucial for accurate assembly, enabling practical, diagram-guided 3D assembly systems.

Abstract

Assembling furniture amounts to solving the discrete-continuous optimization task of selecting the furniture parts to assemble and estimating their connecting poses in a physically realistic manner. The problem is hampered by its combinatorially large yet sparse solution space thus making learning to assemble a challenging task for current machine learning models. In this paper, we attempt to solve this task by leveraging the assembly instructions provided in diagrammatic manuals that typically accompany the furniture parts. Our key insight is to use the cues in these diagrams to split the problem into discrete and continuous phases. Specifically, we present Manual-PA, a transformer-based instruction Manual-guided 3D Part Assembly framework that learns to semantically align 3D parts with their illustrations in the manuals using a contrastive learning backbone towards predicting the assembly order and infers the 6D pose of each part via relating it to the final furniture depicted in the manual. To validate the efficacy of our method, we conduct experiments on the benchmark PartNet dataset. Our results show that using the diagrams and the order of the parts lead to significant improvements in assembly performance against the state of the art. Further, Manual-PA demonstrates strong generalization to real-world IKEA furniture assembly on the IKEA-Manual dataset.

Paper Structure

This paper contains 31 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An illustration of the manual-guided 3D part assembly task. Given (a) a diagrammatic manual book demonstrating the step-by-step assembly process and (b) a set of texture-less furniture parts, the goal is to (c) infer the order of parts for the assembly from the manual sequence and predict the 6DoF pose for each part such that the spatially transformed parts assembles the furniture described in the manual.
  • Figure 2: Overview of our proposed method Manual-PA. (a) Feature extraction (\ref{['sec:stage0']}): we extract semantic and geometrical features from both the step diagrams of the assembly manual and the corresponding part point clouds using the image encoder and point encoder, respectively. (b) Manual-guided part permutation learning (\ref{['sec:stage1']}): we compute a similarity matrix $\mathbf{S}$ between the two modalities, and subsequently apply the Hungarian algorithm to obtain the permutation matrix $\mathbf{P}$ for the parts. (c) Manual-guided part pose estimation (\ref{['sec:stage2']}): we add positional encodings (PE) $\Phi$ to both the step diagrams and parts, where the PE order for parts is determined by the order predictions from (b). This is followed by a transformer decoder to enable multimodal feature fusion and interaction, along with a pose prediction head to determine the rotation $R_i$ and translation $t_i$ of each part. The predicted poses are then applied to the corresponding parts to obtain the final assembled shape $\mathbf{S}$.
  • Figure 3: Results trained on chair and tested table and storage. Solid bars: cross-category; lighter bars: same-category. Arrows show performance drop with percentages above.
  • Figure 4: Qualitative comparison of various 3D part assembly methods. Four examples are shown: chair (a) and table (b) from the PartNet dataset, and chair (c) and table (d) from the IKEA-Manual dataset.
  • Figure 5: Visualization of the cross attention scores between step diagrams (top) and a part (left, chair's back) with PE at different positions. The final assembly results are displayed on the right.