Table of Contents
Fetching ...

RynnEC: Bringing MLLMs into Embodied World

Ronghao Dang, Yuqian Yuan, Yunxuan Mao, Kehan Li, Jiangpin Liu, Zhikai Wang, Xin Li, Fan Wang, Deli Zhao

TL;DR

RynnEC tackles the lack of grounded embodied cognition in MLLMs by introducing a region-centric video MLLM with a region encoder and a mask decoder. It leverages an RGB video data-generation pipeline to synthesize object- and spatial-cognition QA data and presents RynnEC-Bench to rigorously evaluate 22 embodied cognitive abilities. Despite a compact 7B size, RynnEC achieves strong object and spatial cognition, outperforming larger proprietary models on multiple metrics, with a 2B variant enabling on-device deployment. This work provides a scalable foundation for embodied intelligence and delivers a practical data-to-benchmark pipeline to generalize across diverse embodied tasks.

Abstract

We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC

RynnEC: Bringing MLLMs into Embodied World

TL;DR

RynnEC tackles the lack of grounded embodied cognition in MLLMs by introducing a region-centric video MLLM with a region encoder and a mask decoder. It leverages an RGB video data-generation pipeline to synthesize object- and spatial-cognition QA data and presents RynnEC-Bench to rigorously evaluate 22 embodied cognitive abilities. Despite a compact 7B size, RynnEC achieves strong object and spatial cognition, outperforming larger proprietary models on multiple metrics, with a 2B variant enabling on-device deployment. This work provides a scalable foundation for embodied intelligence and delivers a practical data-to-benchmark pipeline to generalize across diverse embodied tasks.

Abstract

We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC

Paper Structure

This paper contains 40 sections, 3 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: RynnEC is a video multi-modal large language model (MLLM) specifically designed for embodied cognition tasks. It can accept inputs interwoven from video, region masks, and text, and produce output in the form of text or masks based on the question. RynnEC is capable of addressing a diverse range of object and spatial questions within embodied contexts and plays a significant role in indoor embodied tasks.
  • Figure 2: Embodied Cognition Question-Answer (QA) Data Generation Pipeline: First, objects within the scene are segmented from the video. Subsequently, object and spatial QA pairs are generated via two distinct branches.
  • Figure 3: Overview of embodied cognition dimensions in RynnEC-Bench. RynnEC-Bench includes two subsets: object cognition and spatial cognition, evaluating a total of 22 embodied cognitive abilities.
  • Figure 4: Training paradigm of RynnEC. The model is trained in four progressive stages: 1) Mask Alignment, 2) Object Understanding, 3) Spatial Understanding, and 4) Referring Segmentation.
  • Figure 5: More granular assessments of object cognition and spatial cognition. We compare the best-performing MLLM from each category with our RynnEC-7B.
  • ...and 4 more figures