Table of Contents
Fetching ...

MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots

Ting Huang, Dongjian Li, Rui Yang, Zeyu Zhang, Zida Yang, Hao Tang

TL;DR

MobileVLA-R1 tackles the challenge of grounding natural-language instructions into continuous quadruped control by introducing a hierarchical vision-language-action model that reasons via Chain-of-Thought before acting. It trains in two stages—supervised CoT alignment on MobileVLA-CoT and GRPO-based reinforcement learning—to improve reasoning consistency and control stability. The paper contributes MobileVLA-R1, the MobileVLA-CoT data ecosystem, a CoT data engine, and a GRPO-based training protocol, achieving roughly $5egin{small}%egin{</small>} ext{higher} SR on VLN-CE benchmarks and robust real-world demonstration on a Unitree Go2. This work advances interpretable, generalizable embodied agents by tightly coupling explicit reasoning with continuous actuation, enabling more reliable long-horizon navigation and manipulation in real-world scenarios.

Abstract

Grounding natural-language instructions into continuous control for quadruped robots remains a fundamental challenge in vision language action. Existing methods struggle to bridge high-level semantic reasoning and low-level actuation, leading to unstable grounding and weak generalization in the real world. To address these issues, we present MobileVLA-R1, a unified vision-language-action framework that enables explicit reasoning and continuous control for quadruped robots. We construct MobileVLA-CoT, a large-scale dataset of multi-granularity chain-of-thought (CoT) for embodied trajectories, providing structured reasoning supervision for alignment. Built upon this foundation, we introduce a two-stage training paradigm that combines supervised CoT alignment with GRPO reinforcement learning to enhance reasoning consistency, control stability, and long-horizon execution. Extensive evaluations on VLN and VLA tasks demonstrate superior performance over strong baselines, with approximately a 5% improvement. Real-world deployment on a quadruped robot validates robust performance in complex environments. Code: https://github.com/AIGeeksGroup/MobileVLA-R1. Website: https://aigeeksgroup.github.io/MobileVLA-R1.

MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots

TL;DR

MobileVLA-R1 tackles the challenge of grounding natural-language instructions into continuous quadruped control by introducing a hierarchical vision-language-action model that reasons via Chain-of-Thought before acting. It trains in two stages—supervised CoT alignment on MobileVLA-CoT and GRPO-based reinforcement learning—to improve reasoning consistency and control stability. The paper contributes MobileVLA-R1, the MobileVLA-CoT data ecosystem, a CoT data engine, and a GRPO-based training protocol, achieving roughly $5egin{small}%egin{</small>} ext{higher} SR on VLN-CE benchmarks and robust real-world demonstration on a Unitree Go2. This work advances interpretable, generalizable embodied agents by tightly coupling explicit reasoning with continuous actuation, enabling more reliable long-horizon navigation and manipulation in real-world scenarios.

Abstract

Grounding natural-language instructions into continuous control for quadruped robots remains a fundamental challenge in vision language action. Existing methods struggle to bridge high-level semantic reasoning and low-level actuation, leading to unstable grounding and weak generalization in the real world. To address these issues, we present MobileVLA-R1, a unified vision-language-action framework that enables explicit reasoning and continuous control for quadruped robots. We construct MobileVLA-CoT, a large-scale dataset of multi-granularity chain-of-thought (CoT) for embodied trajectories, providing structured reasoning supervision for alignment. Built upon this foundation, we introduce a two-stage training paradigm that combines supervised CoT alignment with GRPO reinforcement learning to enhance reasoning consistency, control stability, and long-horizon execution. Extensive evaluations on VLN and VLA tasks demonstrate superior performance over strong baselines, with approximately a 5% improvement. Real-world deployment on a quadruped robot validates robust performance in complex environments. Code: https://github.com/AIGeeksGroup/MobileVLA-R1. Website: https://aigeeksgroup.github.io/MobileVLA-R1.

Paper Structure

This paper contains 20 sections, 6 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Real-world demonstration of MobileVLA-R1. Upon receiving natural-language instructions, MobileVLA-R1 processes RGB video streams through a vision–language model to perform spatial reasoning and generate continuous locomotion commands, enabling the quadruped robot to accomplish complex tasks in real-world environments.
  • Figure 2: Architecture of MobileVLA-R1. MobileVLA-R1 is an end-to-end framework that integrates natural-language instructions with multimodal perception. It processes RGB, depth, and point cloud observations together with textual commands to generate continuous locomotion actions, enabling mobile robots to follow complex instructions and adapt to diverse environments in real time.
  • Figure 3: CoT Data Engine. We construct the MobileVLA-CoT by defining navigation and step-level instructions, integrating RGB–Depth visual inputs, and specifying structured reasoning prompts. These inputs are fed into Gemini-2.5-Flash, which generates multi-granularity Chain-of-Thought (CoT) annotations with corresponding action outputs.
  • Figure 4: The pipeline of RL policy. The model generates $N$ responses from a given input, rewards are then computed for each response. After normalizing and clipping, these rewards are conflated with a KL-divergence term, which prevents the model from over-updating, to update the policy.
  • Figure 5: (a) Hardware platform: the Unitree Go2 quadruped robot is equipped with a Jetson Orin Nano (on-board PC) as the computation module, an L2 LiDAR for 3D environment perception, and an Intel RealSense D435i RGB-D camera for visual sensing. (b) Deployment process: RGB–Depth and 3D point cloud data are transmitted to MobileVLA-R1, which performs multimodal reasoning and action generation. The resulting velocity and motion commands are sent back to the on-board PC for real-time execution on the robot. (c) Real-World qualitative results: MobileVLA-R1 effectively integrates RGB, depth, and map observations to follow long-horizon language instructions with coherent spatial reasoning.
  • ...and 5 more figures