Table of Contents
Fetching ...

Mind to Hand: Purposeful Robotic Control via Embodied Reasoning

Peijun Tang, Shangjin Xie, Binyan Sun, Baifu Huang, Kuncheng Luo, Haotian Yang, Weiqi Jin, Jianan Wang

TL;DR

Lumo-1 tackles the challenge of grounding embodied reasoning in robotic control to enable generalist, instruction-following manipulation across diverse objects and environments. It introduces a three-stage training pipeline that progressively grounds language understanding, cross-embodiment action learning, and task-centric reasoning, complemented by a spatial action tokenizer and a continuous action expert via flow matching. The model produces reasoning traces and integrates reinforcement learning to align high-level reasoning with low-level motion, achieving strong generalization in long-horizon and dexterous tasks on a complex mobile manipulator. Empirical results across VLM benchmarks, generalizable pick-and-place, and fine-tuning tasks demonstrate improved embodied reasoning, stable cross-embodiment transfer, and data-efficient scaling laws guiding future data collection. Overall, Lumo-1 advances generalist robotic policies with interpretable reasoning and robust real-world performance.

Abstract

Humans act with context and intention, with reasoning playing a central role. While internet-scale data has enabled broad reasoning capabilities in AI systems, grounding these abilities in physical action remains a major challenge. We introduce Lumo-1, a generalist vision-language-action (VLA) model that unifies robot reasoning ("mind") with robot action ("hand"). Our approach builds upon the general multi-modal reasoning capabilities of pre-trained vision-language models (VLMs), progressively extending them to embodied reasoning and action prediction, and ultimately towards structured reasoning and reasoning-action alignment. This results in a three-stage pre-training pipeline: (1) Continued VLM pre-training on curated vision-language data to enhance embodied reasoning skills such as planning, spatial understanding, and trajectory prediction; (2) Co-training on cross-embodiment robot data alongside vision-language data; and (3) Action training with reasoning process on trajectories collected on Astribot S1, a bimanual mobile manipulator with human-like dexterity and agility. Finally, we integrate reinforcement learning to further refine reasoning-action consistency and close the loop between semantic inference and motor control. Extensive experiments demonstrate that Lumo-1 achieves significant performance improvements in embodied vision-language reasoning, a critical component for generalist robotic control. Real-world evaluations further show that Lumo-1 surpasses strong baselines across a wide range of challenging robotic tasks, with strong generalization to novel objects and environments, excelling particularly in long-horizon tasks and responding to human-natural instructions that require reasoning over strategy, concepts and space.

Mind to Hand: Purposeful Robotic Control via Embodied Reasoning

TL;DR

Lumo-1 tackles the challenge of grounding embodied reasoning in robotic control to enable generalist, instruction-following manipulation across diverse objects and environments. It introduces a three-stage training pipeline that progressively grounds language understanding, cross-embodiment action learning, and task-centric reasoning, complemented by a spatial action tokenizer and a continuous action expert via flow matching. The model produces reasoning traces and integrates reinforcement learning to align high-level reasoning with low-level motion, achieving strong generalization in long-horizon and dexterous tasks on a complex mobile manipulator. Empirical results across VLM benchmarks, generalizable pick-and-place, and fine-tuning tasks demonstrate improved embodied reasoning, stable cross-embodiment transfer, and data-efficient scaling laws guiding future data collection. Overall, Lumo-1 advances generalist robotic policies with interpretable reasoning and robust real-world performance.

Abstract

Humans act with context and intention, with reasoning playing a central role. While internet-scale data has enabled broad reasoning capabilities in AI systems, grounding these abilities in physical action remains a major challenge. We introduce Lumo-1, a generalist vision-language-action (VLA) model that unifies robot reasoning ("mind") with robot action ("hand"). Our approach builds upon the general multi-modal reasoning capabilities of pre-trained vision-language models (VLMs), progressively extending them to embodied reasoning and action prediction, and ultimately towards structured reasoning and reasoning-action alignment. This results in a three-stage pre-training pipeline: (1) Continued VLM pre-training on curated vision-language data to enhance embodied reasoning skills such as planning, spatial understanding, and trajectory prediction; (2) Co-training on cross-embodiment robot data alongside vision-language data; and (3) Action training with reasoning process on trajectories collected on Astribot S1, a bimanual mobile manipulator with human-like dexterity and agility. Finally, we integrate reinforcement learning to further refine reasoning-action consistency and close the loop between semantic inference and motor control. Extensive experiments demonstrate that Lumo-1 achieves significant performance improvements in embodied vision-language reasoning, a critical component for generalist robotic control. Real-world evaluations further show that Lumo-1 surpasses strong baselines across a wide range of challenging robotic tasks, with strong generalization to novel objects and environments, excelling particularly in long-horizon tasks and responding to human-natural instructions that require reasoning over strategy, concepts and space.

Paper Structure

This paper contains 56 sections, 13 equations, 22 figures, 5 tables.

Figures (22)

  • Figure 1: Model Architecture Illustration. Lumo-1 supports next-token prediction for both vision-language and action data, as well as flow-matching for modeling continuous actions.
  • Figure 2: Illustration of Spatial Action Tokenizer. (a) Robot trajectories are decomposed into the shortest subsequence of states (waypoints) within an acceptable reconstruction error budget using AWE shi2023waypoint. (b) The motion token library is constructed by clustering delta actions from a large-scale, diverse dataset, with rotation and translation processed independently. During training, at each timestep, one of the top-3 closest tokens are randomly selected from the motion token library to approximate the next waypoint, the selected token then serves as the reference for determining the subsequent token. (c) shows the probability densities of delta actions derived from a diverse robot trajectory dataset, projected onto 2D planes.
  • Figure 3: Overview of Curated Vision-Language Data. The curated dataset is designed to enhance core embodied reasoning abilities while preserving the general multi-modal understanding and reasoning capabilities of the pre-trained VLM.
  • Figure 4: Distribution of Data Mixture: (Left) We curate a VLM dataset comprising roughly 16.3M samples that extend general multi-modal understanding with an emphasis on spatial perception, spatial reasoning, embodied planning, and robot trajectory prediction. During Stage1 continued VLM pre-training, we further prioritize spatial understanding as it forms the foundation of embodied reasoning. (Right) Stage2 co-trains on diverse cross-embodiment bimanual trajectories from Genie-1, Astribot S1 prototype, and bimanual ARX/YAM/Agile X, along with VLM data down-sampled to contribute $5.84\%$ of total training tokens.
  • Figure 5: Sample Tasks Collected on Astribot S1. The tasks encompass a wide range of everyday activities, collected across diverse objects, lighting conditions, and environments.
  • ...and 17 more figures