Table of Contents
Fetching ...

Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation

Xiangtong Yao, Hongkuan Zhou, Oier Mees, Yuan Meng, Ted Xiao, Yonatan Bisk, Jean Oh, Edward Johns, Mohit Shridhar, Dhruv Shah, Jesse Thomason, Kai Huang, Joyce Chai, Zhenshan Bing, Alois Knoll

TL;DR

The paper surveys language-conditioned robot manipulation through a functional taxonomy: language for state evaluation, language as a policy condition, and language for cognitive planning and reasoning. It surveys RL, IL, diffusion-based policies, and neuro-symbolic approaches, highlighting the rise of foundation models (LLMs, VLMs) and VLAs as central enablers, while also addressing data strategies, computation costs, and real-world deployment challenges. Key contributions include a cross-sectional analysis across action granularity, supervision regimes, and evaluation environments, plus a discussion of open problems in generalization, safety, and real-time performance. The survey argues for structured, hybrid, and data-efficient approaches, including cross-embodiment alignment and lifelong learning, to advance robust, scalable, language-grounded robotic manipulation in unstructured real-world settings.

Abstract

Language-conditioned robot manipulation is an emerging field aimed at enabling seamless communication and cooperation between humans and robotic agents by teaching robots to comprehend and execute instructions conveyed in natural language. This interdisciplinary area integrates scene understanding, language processing, and policy learning to bridge the gap between human instructions and robot actions. In this comprehensive survey, we systematically explore recent advancements in language-conditioned robot manipulation. We categorize existing methods based on the primary ways language is integrated into the robot system, namely language for state evaluation, language as a policy condition, and language for cognitive planning and reasoning. Specifically, we further analyze state-of-the-art techniques from four axes of action granularity, data and supervision regimes, system cost and latency, and environments and evaluations. Additionally, we highlight the key debates in the field. Finally, we discuss open challenges and future research directions, focusing on potentially enhancing generalization capabilities and addressing safety issues in language-conditioned robot manipulators.

Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation

TL;DR

The paper surveys language-conditioned robot manipulation through a functional taxonomy: language for state evaluation, language as a policy condition, and language for cognitive planning and reasoning. It surveys RL, IL, diffusion-based policies, and neuro-symbolic approaches, highlighting the rise of foundation models (LLMs, VLMs) and VLAs as central enablers, while also addressing data strategies, computation costs, and real-world deployment challenges. Key contributions include a cross-sectional analysis across action granularity, supervision regimes, and evaluation environments, plus a discussion of open problems in generalization, safety, and real-time performance. The survey argues for structured, hybrid, and data-efficient approaches, including cross-embodiment alignment and lifelong learning, to advance robust, scalable, language-grounded robotic manipulation in unstructured real-world settings.

Abstract

Language-conditioned robot manipulation is an emerging field aimed at enabling seamless communication and cooperation between humans and robotic agents by teaching robots to comprehend and execute instructions conveyed in natural language. This interdisciplinary area integrates scene understanding, language processing, and policy learning to bridge the gap between human instructions and robot actions. In this comprehensive survey, we systematically explore recent advancements in language-conditioned robot manipulation. We categorize existing methods based on the primary ways language is integrated into the robot system, namely language for state evaluation, language as a policy condition, and language for cognitive planning and reasoning. Specifically, we further analyze state-of-the-art techniques from four axes of action granularity, data and supervision regimes, system cost and latency, and environments and evaluations. Additionally, we highlight the key debates in the field. Finally, we discuss open challenges and future research directions, focusing on potentially enhancing generalization capabilities and addressing safety issues in language-conditioned robot manipulators.
Paper Structure (72 sections, 15 equations, 17 figures, 7 tables)

This paper contains 72 sections, 15 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Language-conditioned manipulation sits at the intersection of computer vision, natural language processing, and robotics. Scene understanding, language understanding/grounding, policy learning/design, and action execution are widely studied in this realm. This field leverages a variety of techniques, such as Vision-Language Models, Large Language Models, Vision-Language-Action Models, Imitation Learning, Reinforcement Learning, or Planning, to achieve behavior.
  • Figure 2: This architectural framework provides an high-level overview of language-conditioned robot manipulation. The agent comprises three key modules: the language module, the perception module, and the control module. These modules serve the functions of understanding instructions, perceiving the environment's state, and acquiring skills, respectively. The vision-language module establishes a connection between instructions and the surrounding environment to achieve a more profound comprehension of both aspects. This interplay of information from both modalities enables the robot to engage in high-level planning and perform visual question answering tasks, ultimately enhancing its overall performance. The control module has the capability to acquire low-level policies through learning from rewards (reinforcement learning) and demonstrations (imitation learning), which are engineered by experts. At times, these low-level policies can also be directly designed or hard-coded, making use of path and motion planning algorithms. There are two key loops to highlight. The interactive loop, located on the left, facilitates human-robot language interaction. The control loop, positioned on the right, signifies the interaction between the agent and its surrounding environment.
  • Figure 3: Overview list of representative language-conditioned robotic manipulation methods.
  • Figure 4: An illustration of different reward schemes. (a) Dense reward: The agent receives a gradually increasing reward as it approaches the goal, providing continuous guidance. (b) Sparse reward: The agent only receives a large reward upon reaching the final goal state. (c) Reward function learning: A function is learned to map state-goal pairs to a continuous reward value, creating a smooth reward gradient.
  • Figure 5: Taxonomy of Sec. \ref{['sec:language-for-state-evaluation']} Language for state evaluation.
  • ...and 12 more figures