Table of Contents
Fetching ...

Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning

Baining Zhao, Ziyou Wang, Jianjie Fang, Chen Gao, Fanhang Man, Jinqiang Cui, Xin Wang, Xinlei Chen, Yong Li, Wenwu Zhu

TL;DR

Embodied-R tackles embodied spatial reasoning by decoupling perception and reasoning: using a large vision-language model to process continuous video frames and a small language model trained with reinforcement learning to perform slow-thinking and reasoning. A novel key-frame extractor reduces computation, while a three-part reward system including a logical-consistency reward aligns thinking with answers. The approach achieves state-of-the-art-like performance on in-distribution and out-of-distribution embodied tasks with modest data (5k videos) and a 3B LM, and shows emergent slow-thinking behaviors and robust generalization. These findings highlight the value of perception-reasoning collaboration and reward design for building embodied reasoning in resource-efficient settings.

Abstract

Humans can perceive and reason about spatial relationships from sequential visual observations, such as egocentric video streams. However, how pretrained models acquire such abilities, especially high-level reasoning, remains unclear. This paper introduces Embodied-R, a collaborative framework combining large-scale Vision-Language Models (VLMs) for perception and small-scale Language Models (LMs) for reasoning. Using Reinforcement Learning (RL) with a novel reward system considering think-answer logical consistency, the model achieves slow-thinking capabilities with limited computational resources. After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models (OpenAI-o1, Gemini-2.5-pro) on both in-distribution and out-of-distribution embodied spatial reasoning tasks. Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration. We further explore research questions including response length, training on VLM, strategies for reward design, and differences in model generalization after SFT (Supervised Fine-Tuning) and RL training.

Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning

TL;DR

Embodied-R tackles embodied spatial reasoning by decoupling perception and reasoning: using a large vision-language model to process continuous video frames and a small language model trained with reinforcement learning to perform slow-thinking and reasoning. A novel key-frame extractor reduces computation, while a three-part reward system including a logical-consistency reward aligns thinking with answers. The approach achieves state-of-the-art-like performance on in-distribution and out-of-distribution embodied tasks with modest data (5k videos) and a 3B LM, and shows emergent slow-thinking behaviors and robust generalization. These findings highlight the value of perception-reasoning collaboration and reward design for building embodied reasoning in resource-efficient settings.

Abstract

Humans can perceive and reason about spatial relationships from sequential visual observations, such as egocentric video streams. However, how pretrained models acquire such abilities, especially high-level reasoning, remains unclear. This paper introduces Embodied-R, a collaborative framework combining large-scale Vision-Language Models (VLMs) for perception and small-scale Language Models (LMs) for reasoning. Using Reinforcement Learning (RL) with a novel reward system considering think-answer logical consistency, the model achieves slow-thinking capabilities with limited computational resources. After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models (OpenAI-o1, Gemini-2.5-pro) on both in-distribution and out-of-distribution embodied spatial reasoning tasks. Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration. We further explore research questions including response length, training on VLM, strategies for reward design, and differences in model generalization after SFT (Supervised Fine-Tuning) and RL training.

Paper Structure

This paper contains 34 sections, 12 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The proposed Embodied-R is a collaborative embodied spatial reasoning framework integrating a Vision-Language Model (VLM) and a Language Model (LM). The separation of perception and reasoning enables us to leverage the perceptual capabilities of large-scale VLMs while training a resource-efficient small-scale LM to activate embodied reasoning through RL. Notably, we introduce a novel logical consistency reward to guide the LM in producing logically coherent reasoning and answer.
  • Figure 2: Case Analysis: Embodied-R has initially developed the ability for slow-thinking: it can think before answering, effectively distinguish spatial relationships, provide structured and organized responses, and integrate information across multiple frames for embodied scene analysis.
  • Figure 3: Ablation of RL training and comparison to other language models.
  • Figure 4: a-d. The GRPO training process (a: accuracy reward; b: format reward; c: ratio of logical consistency reward to accuracy reward; d: response length of validation set). e. Comparison of accuracy reward curves for RL training of equivalently sized LM and VLM models. f. Model performance before and after integrating logical consistency reward. g. Comparison of generalization performance between models trained with RL and SFT.