Mind to Hand: Purposeful Robotic Control via Embodied Reasoning

Peijun Tang; Shangjin Xie; Binyan Sun; Baifu Huang; Kuncheng Luo; Haotian Yang; Weiqi Jin; Jianan Wang

Mind to Hand: Purposeful Robotic Control via Embodied Reasoning

Peijun Tang, Shangjin Xie, Binyan Sun, Baifu Huang, Kuncheng Luo, Haotian Yang, Weiqi Jin, Jianan Wang

TL;DR

Lumo-1 tackles the challenge of grounding embodied reasoning in robotic control to enable generalist, instruction-following manipulation across diverse objects and environments. It introduces a three-stage training pipeline that progressively grounds language understanding, cross-embodiment action learning, and task-centric reasoning, complemented by a spatial action tokenizer and a continuous action expert via flow matching. The model produces reasoning traces and integrates reinforcement learning to align high-level reasoning with low-level motion, achieving strong generalization in long-horizon and dexterous tasks on a complex mobile manipulator. Empirical results across VLM benchmarks, generalizable pick-and-place, and fine-tuning tasks demonstrate improved embodied reasoning, stable cross-embodiment transfer, and data-efficient scaling laws guiding future data collection. Overall, Lumo-1 advances generalist robotic policies with interpretable reasoning and robust real-world performance.

Abstract

Humans act with context and intention, with reasoning playing a central role. While internet-scale data has enabled broad reasoning capabilities in AI systems, grounding these abilities in physical action remains a major challenge. We introduce Lumo-1, a generalist vision-language-action (VLA) model that unifies robot reasoning ("mind") with robot action ("hand"). Our approach builds upon the general multi-modal reasoning capabilities of pre-trained vision-language models (VLMs), progressively extending them to embodied reasoning and action prediction, and ultimately towards structured reasoning and reasoning-action alignment. This results in a three-stage pre-training pipeline: (1) Continued VLM pre-training on curated vision-language data to enhance embodied reasoning skills such as planning, spatial understanding, and trajectory prediction; (2) Co-training on cross-embodiment robot data alongside vision-language data; and (3) Action training with reasoning process on trajectories collected on Astribot S1, a bimanual mobile manipulator with human-like dexterity and agility. Finally, we integrate reinforcement learning to further refine reasoning-action consistency and close the loop between semantic inference and motor control. Extensive experiments demonstrate that Lumo-1 achieves significant performance improvements in embodied vision-language reasoning, a critical component for generalist robotic control. Real-world evaluations further show that Lumo-1 surpasses strong baselines across a wide range of challenging robotic tasks, with strong generalization to novel objects and environments, excelling particularly in long-horizon tasks and responding to human-natural instructions that require reasoning over strategy, concepts and space.

Mind to Hand: Purposeful Robotic Control via Embodied Reasoning

TL;DR

Abstract

Mind to Hand: Purposeful Robotic Control via Embodied Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (22)