Table of Contents
Fetching ...

Natural Language as Policies: Reasoning for Coordinate-Level Embodied Control with LLMs

Yusuke Mikami, Andrew Melnik, Jun Miura, Ville Hautamäki

TL;DR

This paper proposes a semantic, natural language reasoning framework for robotics task planning that directly outputs coordinate-level actions, avoiding reliance on predefined APIs or code-as-policy approaches. By describing objects and tasks in natural language and employing Chain-of-Thought reasoning, the method generates actionable coordinates from multimodal prompts, then maps front-view coordinates to top-view for execution. Ablation studies demonstrate that explicit NL reasoning substantially boosts success rates, especially for novel tasks, while still facing challenges in precise rotations and complex actions. Overall, the work highlights the potential of NL-centric planning to improve generalization and skill transfer in open-world robotic scenarios, suggesting a promising direction for more flexible, language-grounded embodied agents.

Abstract

We demonstrate experimental results with LLMs that address robotics task planning problems. Recently, LLMs have been applied in robotics task planning, particularly using a code generation approach that converts complex high-level instructions into mid-level policy codes. In contrast, our approach acquires text descriptions of the task and scene objects, then formulates task planning through natural language reasoning, and outputs coordinate level control commands, thus reducing the necessity for intermediate representation code as policies with pre-defined APIs. Our approach is evaluated on a multi-modal prompt simulation benchmark, demonstrating that our prompt engineering experiments with natural language reasoning significantly enhance success rates compared to its absence. Furthermore, our approach illustrates the potential for natural language descriptions to transfer robotics skills from known tasks to previously unseen tasks. The project website: https://natural-language-as-policies.github.io/

Natural Language as Policies: Reasoning for Coordinate-Level Embodied Control with LLMs

TL;DR

This paper proposes a semantic, natural language reasoning framework for robotics task planning that directly outputs coordinate-level actions, avoiding reliance on predefined APIs or code-as-policy approaches. By describing objects and tasks in natural language and employing Chain-of-Thought reasoning, the method generates actionable coordinates from multimodal prompts, then maps front-view coordinates to top-view for execution. Ablation studies demonstrate that explicit NL reasoning substantially boosts success rates, especially for novel tasks, while still facing challenges in precise rotations and complex actions. Overall, the work highlights the potential of NL-centric planning to improve generalization and skill transfer in open-world robotic scenarios, suggesting a promising direction for more flexible, language-grounded embodied agents.

Abstract

We demonstrate experimental results with LLMs that address robotics task planning problems. Recently, LLMs have been applied in robotics task planning, particularly using a code generation approach that converts complex high-level instructions into mid-level policy codes. In contrast, our approach acquires text descriptions of the task and scene objects, then formulates task planning through natural language reasoning, and outputs coordinate level control commands, thus reducing the necessity for intermediate representation code as policies with pre-defined APIs. Our approach is evaluated on a multi-modal prompt simulation benchmark, demonstrating that our prompt engineering experiments with natural language reasoning significantly enhance success rates compared to its absence. Furthermore, our approach illustrates the potential for natural language descriptions to transfer robotics skills from known tasks to previously unseen tasks. The project website: https://natural-language-as-policies.github.io/
Paper Structure (25 sections, 3 figures, 5 tables)

This paper contains 25 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of our approach. We provide one demonstration as an in-context example, and a planning step employing natural language reasoning instead of conventional code implementation. We remove the CoT reasoning component in the in-context example for our ablation study to check the importance of natural language reasoning. We use low-level API(pick-and-place or sweep) to control the robot arm. We present specific examples of natural language reasoning in Table.\ref{['table:reasoning']}.
  • Figure 2: Task planning is a mapping process from high-level human intention into low-level action commands(vertical axis). To achieve a general-purpose agent, it is important to reduce reliance on static components.
  • Figure 3: The full prompt with ellipses indicating omitted sections due to space limitations.