MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots

Ting Huang; Dongjian Li; Rui Yang; Zeyu Zhang; Zida Yang; Hao Tang

MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots

Ting Huang, Dongjian Li, Rui Yang, Zeyu Zhang, Zida Yang, Hao Tang

TL;DR

MobileVLA-R1 tackles the challenge of grounding natural-language instructions into continuous quadruped control by introducing a hierarchical vision-language-action model that reasons via Chain-of-Thought before acting. It trains in two stages—supervised CoT alignment on MobileVLA-CoT and GRPO-based reinforcement learning—to improve reasoning consistency and control stability. The paper contributes MobileVLA-R1, the MobileVLA-CoT data ecosystem, a CoT data engine, and a GRPO-based training protocol, achieving roughly $5egin{small}%egin{</small>} ext{higher} SR on VLN-CE benchmarks and robust real-world demonstration on a Unitree Go2. This work advances interpretable, generalizable embodied agents by tightly coupling explicit reasoning with continuous actuation, enabling more reliable long-horizon navigation and manipulation in real-world scenarios.

Abstract

Grounding natural-language instructions into continuous control for quadruped robots remains a fundamental challenge in vision language action. Existing methods struggle to bridge high-level semantic reasoning and low-level actuation, leading to unstable grounding and weak generalization in the real world. To address these issues, we present MobileVLA-R1, a unified vision-language-action framework that enables explicit reasoning and continuous control for quadruped robots. We construct MobileVLA-CoT, a large-scale dataset of multi-granularity chain-of-thought (CoT) for embodied trajectories, providing structured reasoning supervision for alignment. Built upon this foundation, we introduce a two-stage training paradigm that combines supervised CoT alignment with GRPO reinforcement learning to enhance reasoning consistency, control stability, and long-horizon execution. Extensive evaluations on VLN and VLA tasks demonstrate superior performance over strong baselines, with approximately a 5% improvement. Real-world deployment on a quadruped robot validates robust performance in complex environments. Code: https://github.com/AIGeeksGroup/MobileVLA-R1. Website: https://aigeeksgroup.github.io/MobileVLA-R1.

MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots

TL;DR

Abstract

MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)