Table of Contents
Fetching ...

Neural Assembler: Learning to Generate Fine-Grained Robotic Assembly Instructions from Multi-View Images

Hongyu Yan, Yadong Mu

TL;DR

The paper tackles image-guided robotic assembly by translating multi-view images of a target 3D brick model into a detailed, executable assembly plan. It introduces Neural Assembler, an end-to-end framework that learns a component graph, per-part 3D poses, and a feasible assembly order by fusing multi-view features, predicting a relation graph with a GCN, and estimating poses from multi-view cues. The approach is validated on two synthetic datasets, CLEVR-Assembly and LEGO-Assembly, showing superior performance over baselines and demonstrating transferable accuracy in real-world robotic experiments. This work advances vision-guided autonomous assembly by enabling fine-grained instruction generation from images, with potential impact on robotic construction tasks under occlusion and view variation.

Abstract

Image-guided object assembly represents a burgeoning research topic in computer vision. This paper introduces a novel task: translating multi-view images of a structural 3D model (for example, one constructed with building blocks drawn from a 3D-object library) into a detailed sequence of assembly instructions executable by a robotic arm. Fed with multi-view images of the target 3D model for replication, the model designed for this task must address several sub-tasks, including recognizing individual components used in constructing the 3D model, estimating the geometric pose of each component, and deducing a feasible assembly order adhering to physical rules. Establishing accurate 2D-3D correspondence between multi-view images and 3D objects is technically challenging. To tackle this, we propose an end-to-end model known as the Neural Assembler. This model learns an object graph where each vertex represents recognized components from the images, and the edges specify the topology of the 3D model, enabling the derivation of an assembly plan. We establish benchmarks for this task and conduct comprehensive empirical evaluations of Neural Assembler and alternative solutions. Our experiments clearly demonstrate the superiority of Neural Assembler.

Neural Assembler: Learning to Generate Fine-Grained Robotic Assembly Instructions from Multi-View Images

TL;DR

The paper tackles image-guided robotic assembly by translating multi-view images of a target 3D brick model into a detailed, executable assembly plan. It introduces Neural Assembler, an end-to-end framework that learns a component graph, per-part 3D poses, and a feasible assembly order by fusing multi-view features, predicting a relation graph with a GCN, and estimating poses from multi-view cues. The approach is validated on two synthetic datasets, CLEVR-Assembly and LEGO-Assembly, showing superior performance over baselines and demonstrating transferable accuracy in real-world robotic experiments. This work advances vision-guided autonomous assembly by enabling fine-grained instruction generation from images, with potential impact on robotic construction tasks under occlusion and view variation.

Abstract

Image-guided object assembly represents a burgeoning research topic in computer vision. This paper introduces a novel task: translating multi-view images of a structural 3D model (for example, one constructed with building blocks drawn from a 3D-object library) into a detailed sequence of assembly instructions executable by a robotic arm. Fed with multi-view images of the target 3D model for replication, the model designed for this task must address several sub-tasks, including recognizing individual components used in constructing the 3D model, estimating the geometric pose of each component, and deducing a feasible assembly order adhering to physical rules. Establishing accurate 2D-3D correspondence between multi-view images and 3D objects is technically challenging. To tackle this, we propose an end-to-end model known as the Neural Assembler. This model learns an object graph where each vertex represents recognized components from the images, and the edges specify the topology of the 3D model, enabling the derivation of an assembly plan. We establish benchmarks for this task and conduct comprehensive empirical evaluations of Neural Assembler and alternative solutions. Our experiments clearly demonstrate the superiority of Neural Assembler.
Paper Structure (16 sections, 5 equations, 9 figures, 3 tables)

This paper contains 16 sections, 5 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Schematic illustration of the proposed Neural Assembler. See Section 3 for more details.
  • Figure 2: The proposed Neural Assembler architecture. An image encoder outputs the visual embeddings of multi-view images. The shape and texture library are provided as visual prompts for object detection. Then the transformer decoder module is applied to get the library-based object features. Finally, the object-conditioned image features are decoded to the bricks' masks, keypoints, and rotation angles, while the global object features are decoded to the bricks' textures, shapes, the number of blocks, and the assembly graph.
  • Figure 3: Illustration of 3D position prediction module.
  • Figure 4: The probability distribution of CCA.
  • Figure 5: Result from CLEVR-Assembly Dataset.
  • ...and 4 more figures