Table of Contents
Fetching ...

LGMCTS: Language-Guided Monte-Carlo Tree Search for Executable Semantic Object Rearrangement

Haonan Chang, Kai Gao, Kowndinya Boyalakuntla, Alex Lee, Baichuan Huang, Harish Udhaya Kumar, Jinjin Yu, Abdeslam Boularias

TL;DR

LGMCTS is presented, a framework that uniquely combines language guidance with geometrically informed sampling distributions to effectively rearrange objects according to geometric patterns dictated by natural language descriptions.

Abstract

We introduce a novel approach to the executable semantic object rearrangement problem. In this challenge, a robot seeks to create an actionable plan that rearranges objects within a scene according to a pattern dictated by a natural language description. Unlike existing methods such as StructFormer and StructDiffusion, which tackle the issue in two steps by first generating poses and then leveraging a task planner for action plan formulation, our method concurrently addresses pose generation and action planning. We achieve this integration using a Language-Guided Monte-Carlo Tree Search (LGMCTS). Quantitative evaluations are provided on two simulation datasets, and complemented by qualitative tests with a real robot.

LGMCTS: Language-Guided Monte-Carlo Tree Search for Executable Semantic Object Rearrangement

TL;DR

LGMCTS is presented, a framework that uniquely combines language guidance with geometrically informed sampling distributions to effectively rearrange objects according to geometric patterns dictated by natural language descriptions.

Abstract

We introduce a novel approach to the executable semantic object rearrangement problem. In this challenge, a robot seeks to create an actionable plan that rearranges objects within a scene according to a pattern dictated by a natural language description. Unlike existing methods such as StructFormer and StructDiffusion, which tackle the issue in two steps by first generating poses and then leveraging a task planner for action plan formulation, our method concurrently addresses pose generation and action planning. We achieve this integration using a Language-Guided Monte-Carlo Tree Search (LGMCTS). Quantitative evaluations are provided on two simulation datasets, and complemented by qualitative tests with a real robot.
Paper Structure (17 sections, 1 theorem, 3 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 17 sections, 1 theorem, 3 equations, 6 figures, 2 tables, 1 algorithm.

Key Result

Proposition IV.1

MCTS-Planner is probabilistic complete.

Figures (6)

  • Figure 1: Robotic Setup: a UR5e robot equipped with a RealSense D455 camera. The task is to re-arrange the objects, which are unknown to the robot, according to a natural language instruction.
  • Figure 2: An example of language parsing. We are using GPT-4 brown2020language in this work.
  • Figure 3: Visualization of $(x,y)$ prior for 'line' pattern. From left to right: $K=0$, $K=1$, $K=2$, $K=3$, where $K=|O_{R}^{sampled}|$, the number of sampled object poses. White star marks are sampled poses. When $K=0$, the pose can be sampled anywhere. When $K=1$, it needed to sampled outside a circle region. After that, all poses will be sampled along the line defined by the first two poses.
  • Figure 4: A minimal example illustrates our MCTS-Planner's aim to arrange a table. The language description provided is: "Can you please put the apple behind the spoon? And I also want the cup at the right of the apple." The top row displays the current scene arrangement, while the bottom row shows the $f_{prior}$ and $f_{free}$ for the object being manipulated. $f=f_{prior} \times f_{free}$. In spatial distribution figures, black represents probability 0, and white probability 1.
  • Figure 5: Real world demonstration with a UR5e robot. The language instructions for the five scenes are: (a) "Move all blocks into a circle; while put the white bottle behind one block;" (b) "Put all boxes into a rectangle; and move the white bottle to the right of one box;" (c) "Move bottles into a line; and formulate all phones into another line;" (d) "Formulate all yellow objects into a line;" (e) "Set all phones into a line;". The top row images show the initial scenes and the bottom ones show the results of using LGMCTS on the UR5e. Dotted lines imply a shape pattern and red arrows indicate a spatial pattern (left, right, front, back). These real robot experiments show that LGMCTS can parse complex language instructions and also deal with infeasible start configurations as well as pattern composition.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Proposition IV.1
  • proof