Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation
Xiangtong Yao, Hongkuan Zhou, Oier Mees, Yuan Meng, Ted Xiao, Yonatan Bisk, Jean Oh, Edward Johns, Mohit Shridhar, Dhruv Shah, Jesse Thomason, Kai Huang, Joyce Chai, Zhenshan Bing, Alois Knoll
TL;DR
The paper surveys language-conditioned robot manipulation through a functional taxonomy: language for state evaluation, language as a policy condition, and language for cognitive planning and reasoning. It surveys RL, IL, diffusion-based policies, and neuro-symbolic approaches, highlighting the rise of foundation models (LLMs, VLMs) and VLAs as central enablers, while also addressing data strategies, computation costs, and real-world deployment challenges. Key contributions include a cross-sectional analysis across action granularity, supervision regimes, and evaluation environments, plus a discussion of open problems in generalization, safety, and real-time performance. The survey argues for structured, hybrid, and data-efficient approaches, including cross-embodiment alignment and lifelong learning, to advance robust, scalable, language-grounded robotic manipulation in unstructured real-world settings.
Abstract
Language-conditioned robot manipulation is an emerging field aimed at enabling seamless communication and cooperation between humans and robotic agents by teaching robots to comprehend and execute instructions conveyed in natural language. This interdisciplinary area integrates scene understanding, language processing, and policy learning to bridge the gap between human instructions and robot actions. In this comprehensive survey, we systematically explore recent advancements in language-conditioned robot manipulation. We categorize existing methods based on the primary ways language is integrated into the robot system, namely language for state evaluation, language as a policy condition, and language for cognitive planning and reasoning. Specifically, we further analyze state-of-the-art techniques from four axes of action granularity, data and supervision regimes, system cost and latency, and environments and evaluations. Additionally, we highlight the key debates in the field. Finally, we discuss open challenges and future research directions, focusing on potentially enhancing generalization capabilities and addressing safety issues in language-conditioned robot manipulators.
