Table of Contents
Fetching ...

Learning to Build by Building Your Own Instructions

Aaron Walsman, Muru Zhang, Adam Fishman, Ali Farhadi, Dieter Fox

TL;DR

A new technique for the recently proposed Break-and-Make problem in LTRON where an agent must learn to build a previously unseen LEGO assembly using a single interactive session to gather information about its components and their structure is developed.

Abstract

Structural understanding of complex visual objects is an important unsolved component of artificial intelligence. To study this, we develop a new technique for the recently proposed Break-and-Make problem in LTRON where an agent must learn to build a previously unseen LEGO assembly using a single interactive session to gather information about its components and their structure. We attack this problem by building an agent that we call \textbf{\ours} that is able to make its own visual instruction book. By disassembling an unseen assembly and periodically saving images of it, the agent is able to create a set of instructions so that it has the information necessary to rebuild it. These instructions form an explicit memory that allows the model to reason about the assembly process one step at a time, avoiding the need for long-term implicit memory. This in turn allows us to train on much larger LEGO assemblies than has been possible in the past. To demonstrate the power of this model, we release a new dataset of procedurally built LEGO vehicles that contain an average of 31 bricks each and require over one hundred steps to disassemble and reassemble. We train these models using online imitation learning which allows the model to learn from its own mistakes. Finally, we also provide some small improvements to LTRON and the Break-and-Make problem that simplify the learning environment and improve usability.

Learning to Build by Building Your Own Instructions

TL;DR

A new technique for the recently proposed Break-and-Make problem in LTRON where an agent must learn to build a previously unseen LEGO assembly using a single interactive session to gather information about its components and their structure is developed.

Abstract

Structural understanding of complex visual objects is an important unsolved component of artificial intelligence. To study this, we develop a new technique for the recently proposed Break-and-Make problem in LTRON where an agent must learn to build a previously unseen LEGO assembly using a single interactive session to gather information about its components and their structure. We attack this problem by building an agent that we call \textbf{\ours} that is able to make its own visual instruction book. By disassembling an unseen assembly and periodically saving images of it, the agent is able to create a set of instructions so that it has the information necessary to rebuild it. These instructions form an explicit memory that allows the model to reason about the assembly process one step at a time, avoiding the need for long-term implicit memory. This in turn allows us to train on much larger LEGO assemblies than has been possible in the past. To demonstrate the power of this model, we release a new dataset of procedurally built LEGO vehicles that contain an average of 31 bricks each and require over one hundred steps to disassemble and reassemble. We train these models using online imitation learning which allows the model to learn from its own mistakes. Finally, we also provide some small improvements to LTRON and the Break-and-Make problem that simplify the learning environment and improve usability.
Paper Structure (26 sections, 2 equations, 5 figures, 5 tables, 2 algorithms)

This paper contains 26 sections, 2 equations, 5 figures, 5 tables, 2 algorithms.

Figures (5)

  • Figure 1: An example of InstructioNet completing the Break and Make task on a previously unseen example from RC-Vehicles. Our model saves 34 distinct images to the instruction stack over the first 69 steps. It then successfully rebuilds the model from scratch using these images over the course of another 135 steps.
  • Figure 2: Our modified LTRON action space without the extra Hand viewport. We only show the manipulation actions here and do not show the camera rotation and done actions which are unchanged.
  • Figure 3: Examples of the RC-Vehicles dataset.
  • Figure 4: Architecture of the InstructioNet model. The current image from the environment, and the top image of the instruction stack are tokenized and provided as input to a vision transformer encoder, along with a single readout token and another discrete token that indicates whether the current phase is Break or Make. The readout token's feature decodes a series of discrete action and parameter heads that determine the high level action mode (Rotate/Translate/Pick/Assemble/Disassemble) as well as action parameters such as the rotation angle or translate distance and direction. The cursor click and release locations are sampled from an attention map comparing features from a DPT decoder.
  • Figure 5: Examples of InstructioNet reconstructions trained on the RC-Vehicles dataset. The top right overlay shows the target assembly. These examples were chosen to present a diverse array of failure and success cases. See Section \ref{['sec:qualatative']} for descriptions of these failures and Section \ref{['sec:evaluation']} for an explanation of the evaluation metrics.