Learn from the Past: Language-conditioned Object Rearrangement with Large Language Models
Guanqun Cao, Ryan Mckenna, Erich Graf, John Oyekan
TL;DR
The paper tackles language-conditioned object rearrangement for robots, highlighting generalization gaps in dataset-reliant methods. It proposes a retrieval-augmented framework that uses a Large Language Model to reference past successful rearrangements, guided by vision-language grounding via SAM and CLIP, enabling zero-shot goal-pose inference. Key contributions include a unified pipeline for vision-grounded reasoning, retrieval-based context from external experiences, and demonstrated improvements in planning and execution across single-object, multi-object, and sequential tasks. This approach reduces data demands and promises flexible, human-like reasoning for robotic manipulation with practical potential for local deployment.
Abstract
Object manipulation for rearrangement into a specific goal state is a significant task for collaborative robots. Accurately determining object placement is a key challenge, as misalignment can increase task complexity and the risk of collisions, affecting the efficiency of the rearrangement process. Most current methods heavily rely on pre-collected datasets to train the model for predicting the goal position. As a result, these methods are restricted to specific instructions, which limits their broader applicability and generalisation. In this paper, we propose a framework of flexible language-conditioned object rearrangement based on the Large Language Model (LLM). Our approach mimics human reasoning by making use of successful past experiences as a reference to infer the best strategies to achieve a current desired goal position. Based on LLM's strong natural language comprehension and inference ability, our method generalises to handle various everyday objects and free-form language instructions in a zero-shot manner. Experimental results demonstrate that our methods can effectively execute the robotic rearrangement tasks, even those involving long sequences of orders.
