Table of Contents
Fetching ...

Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

Dongyoung Kim, Sumin Park, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, Younggyo Seo

TL;DR

Robot-R1 addresses the limitations of SFT-based embodied reasoning in LVLMs for robotics by introducing a reinforcement-learning framework that trains LVLMs to predict the next keypoint state from scene imagery and environment metadata. It reframes continuous state prediction as discrete multiple-choice QA and optimizes reasoning with Group Relative Policy Optimization, guided by auxiliary tasks for current state and movement descriptions. A novel Robot-R1 Bench provides open-ended, grounded evaluation across planning, high-level action, movement, and spatial reasoning, with GPT-4o as the judge demonstrating strong correlation to human judgments. Experiments show that a 7B LVLM trained with Robot-R1 outperforms SFT baselines and even GPT-4o on low-level action reasoning, and transfers to external benchmarks like EmbodiedBench Manipulation and SpatialRGPT, indicating practical impact for efficient and generalizable robotic control.

Abstract

Large Vision-Language Models (LVLMs) have recently shown great promise in advancing robotics by combining embodied reasoning with robot control. A common approach involves training on embodied reasoning tasks related to robot control using Supervised Fine-Tuning (SFT). However, SFT datasets are often heuristically constructed and not explicitly optimized for improving robot control. Furthermore, SFT often leads to issues such as catastrophic forgetting and reduced generalization performance. To address these limitations, we introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control. Robot-R1 learns to predict the next keypoint state required for task completion, conditioned on the current scene image and environment metadata derived from expert demonstrations. Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions. To rigorously evaluate Robot-R1, we also introduce a new benchmark that demands the diverse embodied reasoning capabilities for the task. Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks. Despite having only 7B parameters, Robot-R1 even surpasses GPT-4o on reasoning tasks related to low-level action control, such as spatial and movement reasoning.

Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

TL;DR

Robot-R1 addresses the limitations of SFT-based embodied reasoning in LVLMs for robotics by introducing a reinforcement-learning framework that trains LVLMs to predict the next keypoint state from scene imagery and environment metadata. It reframes continuous state prediction as discrete multiple-choice QA and optimizes reasoning with Group Relative Policy Optimization, guided by auxiliary tasks for current state and movement descriptions. A novel Robot-R1 Bench provides open-ended, grounded evaluation across planning, high-level action, movement, and spatial reasoning, with GPT-4o as the judge demonstrating strong correlation to human judgments. Experiments show that a 7B LVLM trained with Robot-R1 outperforms SFT baselines and even GPT-4o on low-level action reasoning, and transfers to external benchmarks like EmbodiedBench Manipulation and SpatialRGPT, indicating practical impact for efficient and generalizable robotic control.

Abstract

Large Vision-Language Models (LVLMs) have recently shown great promise in advancing robotics by combining embodied reasoning with robot control. A common approach involves training on embodied reasoning tasks related to robot control using Supervised Fine-Tuning (SFT). However, SFT datasets are often heuristically constructed and not explicitly optimized for improving robot control. Furthermore, SFT often leads to issues such as catastrophic forgetting and reduced generalization performance. To address these limitations, we introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control. Robot-R1 learns to predict the next keypoint state required for task completion, conditioned on the current scene image and environment metadata derived from expert demonstrations. Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions. To rigorously evaluate Robot-R1, we also introduce a new benchmark that demands the diverse embodied reasoning capabilities for the task. Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks. Despite having only 7B parameters, Robot-R1 even surpasses GPT-4o on reasoning tasks related to low-level action control, such as spatial and movement reasoning.

Paper Structure

This paper contains 22 sections, 1 equation, 13 figures, 15 tables.

Figures (13)

  • Figure 1: Illustration of the Robot-R1 framework. (a) Robot-R1 uses robot states and image observations from expert demonstrations to create a dataset. (b) These data are reformulated into three different multiple-choice question answering (MCQA) tasks: predicting next states, current states, and movements. (c) During training, an LVLM solves MCQA tasks with reasoning which is then optimized using the GRPO algorithm guo2025deepseek to reinforce reasoning pathways.
  • Figure 2: Illustration of the Robot-R1 Bench. (a) Robot-R1 Bench consists of human-written questions paired with corresponding ground truth (reference) answers. (b) The LVLM under evaluation takes each question along with its associated image as input and generates an answer. (c) The generated answers are scored using GPT-4o, based on predefined rubrics and ground truth answers.
  • Figure 3: Robot-R1 Bench results. In embodied reasoning tailored for low-level control, Robot-R1 outperforms all previously reported models.
  • Figure 4: Waypoint prediction MCQA prompt template
  • Figure 5: Current state prediction MCQA prompt template
  • ...and 8 more figures