Table of Contents
Fetching ...

Learn from the Past: Language-conditioned Object Rearrangement with Large Language Models

Guanqun Cao, Ryan Mckenna, Erich Graf, John Oyekan

TL;DR

The paper tackles language-conditioned object rearrangement for robots, highlighting generalization gaps in dataset-reliant methods. It proposes a retrieval-augmented framework that uses a Large Language Model to reference past successful rearrangements, guided by vision-language grounding via SAM and CLIP, enabling zero-shot goal-pose inference. Key contributions include a unified pipeline for vision-grounded reasoning, retrieval-based context from external experiences, and demonstrated improvements in planning and execution across single-object, multi-object, and sequential tasks. This approach reduces data demands and promises flexible, human-like reasoning for robotic manipulation with practical potential for local deployment.

Abstract

Object manipulation for rearrangement into a specific goal state is a significant task for collaborative robots. Accurately determining object placement is a key challenge, as misalignment can increase task complexity and the risk of collisions, affecting the efficiency of the rearrangement process. Most current methods heavily rely on pre-collected datasets to train the model for predicting the goal position. As a result, these methods are restricted to specific instructions, which limits their broader applicability and generalisation. In this paper, we propose a framework of flexible language-conditioned object rearrangement based on the Large Language Model (LLM). Our approach mimics human reasoning by making use of successful past experiences as a reference to infer the best strategies to achieve a current desired goal position. Based on LLM's strong natural language comprehension and inference ability, our method generalises to handle various everyday objects and free-form language instructions in a zero-shot manner. Experimental results demonstrate that our methods can effectively execute the robotic rearrangement tasks, even those involving long sequences of orders.

Learn from the Past: Language-conditioned Object Rearrangement with Large Language Models

TL;DR

The paper tackles language-conditioned object rearrangement for robots, highlighting generalization gaps in dataset-reliant methods. It proposes a retrieval-augmented framework that uses a Large Language Model to reference past successful rearrangements, guided by vision-language grounding via SAM and CLIP, enabling zero-shot goal-pose inference. Key contributions include a unified pipeline for vision-grounded reasoning, retrieval-based context from external experiences, and demonstrated improvements in planning and execution across single-object, multi-object, and sequential tasks. This approach reduces data demands and promises flexible, human-like reasoning for robotic manipulation with practical potential for local deployment.

Abstract

Object manipulation for rearrangement into a specific goal state is a significant task for collaborative robots. Accurately determining object placement is a key challenge, as misalignment can increase task complexity and the risk of collisions, affecting the efficiency of the rearrangement process. Most current methods heavily rely on pre-collected datasets to train the model for predicting the goal position. As a result, these methods are restricted to specific instructions, which limits their broader applicability and generalisation. In this paper, we propose a framework of flexible language-conditioned object rearrangement based on the Large Language Model (LLM). Our approach mimics human reasoning by making use of successful past experiences as a reference to infer the best strategies to achieve a current desired goal position. Based on LLM's strong natural language comprehension and inference ability, our method generalises to handle various everyday objects and free-form language instructions in a zero-shot manner. Experimental results demonstrate that our methods can effectively execute the robotic rearrangement tasks, even those involving long sequences of orders.

Paper Structure

This paper contains 14 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Learn from the past. In our framework, the robot retrieves past experiences to find the most similar arrangement based on human instructions. By referencing previously successful arrangements, the robot can mimic human-like reasoning (See Fig. \ref{['Architecture']}), allowing it to infer the goal position for rearrangement more effectively.
  • Figure 2: Learn from the past in humans. A diagram showing how humans make use of past experiences to guide successful current and future task completions in life long learning barker1998mental. Mental models are used as templates or references for current tasks. Successful mental models are stored for future reuse.
  • Figure 3: Illustration of the proposed framework. The robot uses SAM for visual perception and CLIP for semantic understanding to identify where and what objects are in the environment. The LLM then associate the most similar past experience with instruction and uses this similar experience as a template and reference. Finally, a prompt is created with spatial and semantic information, allowing the LLM to reason and predict the goal position for rearrangement.
  • Figure 4: Prompt engineering. The simplified prompt for LLM to perform spatial reasoning. It includes the spatial and semantic information from both the observed RGB image and a similar successful experience.
  • Figure 5: Examples of successful rearrangements by humans. Four examples of successful rearrangements arranged by humans, with corresponding instructions.
  • ...and 2 more figures