Table of Contents
Fetching ...

Language Models as Zero-Shot Trajectory Generators

Teyun Kwon, Norman Di Palo, Edward Johns

TL;DR

This work investigates whether a pre-trained GPT-4 can directly generate dense low-level robot trajectories using only object detection/segmentation outputs, without pre-trained skills or trajectory optimisers, via a single task-agnostic prompt. It demonstrates, across 30 real-world manipulation tasks, that the LLM can output executable trajectories or code to generate them and can autonomously detect failures to re-plan. Through extensive prompt ablations, the study identifies key components—stepwise reasoning, function documentation, and explicit gripper control—that boost robustness, with Code-as-Policies generally underperforming on unseen tasks in comparison. The findings push the boundary of LLM applicability in robotics by revealing emergent low-level control capabilities, though they also highlight current limits in precision and perception that future vision-language advances may address.

Abstract

Large Language Models (LLMs) have recently shown promise as high-level planners for robots when given access to a selection of low-level skills. However, it is often assumed that LLMs do not possess sufficient knowledge to be used for the low-level trajectories themselves. In this work, we address this assumption thoroughly, and investigate if an LLM (GPT-4) can directly predict a dense sequence of end-effector poses for manipulation tasks, when given access to only object detection and segmentation vision models. We designed a single, task-agnostic prompt, without any in-context examples, motion primitives, or external trajectory optimisers. Then we studied how well it can perform across 30 real-world language-based tasks, such as "open the bottle cap" and "wipe the plate with the sponge", and we investigated which design choices in this prompt are the most important. Our conclusions raise the assumed limit of LLMs for robotics, and we reveal for the first time that LLMs do indeed possess an understanding of low-level robot control sufficient for a range of common tasks, and that they can additionally detect failures and then re-plan trajectories accordingly. Videos, prompts, and code are available at: https://www.robot-learning.uk/language-models-trajectory-generators.

Language Models as Zero-Shot Trajectory Generators

TL;DR

This work investigates whether a pre-trained GPT-4 can directly generate dense low-level robot trajectories using only object detection/segmentation outputs, without pre-trained skills or trajectory optimisers, via a single task-agnostic prompt. It demonstrates, across 30 real-world manipulation tasks, that the LLM can output executable trajectories or code to generate them and can autonomously detect failures to re-plan. Through extensive prompt ablations, the study identifies key components—stepwise reasoning, function documentation, and explicit gripper control—that boost robustness, with Code-as-Policies generally underperforming on unseen tasks in comparison. The findings push the boundary of LLM applicability in robotics by revealing emergent low-level control capabilities, though they also highlight current limits in precision and perception that future vision-language advances may address.

Abstract

Large Language Models (LLMs) have recently shown promise as high-level planners for robots when given access to a selection of low-level skills. However, it is often assumed that LLMs do not possess sufficient knowledge to be used for the low-level trajectories themselves. In this work, we address this assumption thoroughly, and investigate if an LLM (GPT-4) can directly predict a dense sequence of end-effector poses for manipulation tasks, when given access to only object detection and segmentation vision models. We designed a single, task-agnostic prompt, without any in-context examples, motion primitives, or external trajectory optimisers. Then we studied how well it can perform across 30 real-world language-based tasks, such as "open the bottle cap" and "wipe the plate with the sponge", and we investigated which design choices in this prompt are the most important. Our conclusions raise the assumed limit of LLMs for robotics, and we reveal for the first time that LLMs do indeed possess an understanding of low-level robot control sufficient for a range of common tasks, and that they can additionally detect failures and then re-plan trajectories accordingly. Videos, prompts, and code are available at: https://www.robot-learning.uk/language-models-trajectory-generators.
Paper Structure (9 sections, 12 figures, 1 algorithm)

This paper contains 9 sections, 12 figures, 1 algorithm.

Figures (12)

  • Figure 1: A selection of the tasks we use to study if a single, task-agnostic LLM prompt can generate a dense sequence of end-effector poses, when given only object detection and segmentation models, and no in-context examples, motion primitives, pre-trained skills, or external trajectory optimisers.
  • Figure 2: A taxonomy of requirements of LLM-based zero-shot methods from the recent literature.
  • Figure 3: Example wrist-camera observations received by the robot at the start of each task, and their corresponding task instructions.
  • Figure 4: An overview of the pipeline. (1) The main prompt along with the task instruction is provided to the LLM, from which it (2) generates high-level natural language reasoning steps before outputting Python code (3) to interface with a pre-trained object detection model and execute the generated trajectories on the robot. (4) After task execution, an off-the-shelf object tracking model is used to obtain 3-D bounding boxes of the previously detected objects over the duration of the task, which are then provided to the LLM as numerical values to detect whether the task was executed successfully or not.
  • Figure 5: We investigate the effect of removing parts of the main prompt on task success rates.
  • ...and 7 more figures