RynnEC: Bringing MLLMs into Embodied World
Ronghao Dang, Yuqian Yuan, Yunxuan Mao, Kehan Li, Jiangpin Liu, Zhikai Wang, Xin Li, Fan Wang, Deli Zhao
TL;DR
RynnEC tackles the lack of grounded embodied cognition in MLLMs by introducing a region-centric video MLLM with a region encoder and a mask decoder. It leverages an RGB video data-generation pipeline to synthesize object- and spatial-cognition QA data and presents RynnEC-Bench to rigorously evaluate 22 embodied cognitive abilities. Despite a compact 7B size, RynnEC achieves strong object and spatial cognition, outperforming larger proprietary models on multiple metrics, with a 2B variant enabling on-device deployment. This work provides a scalable foundation for embodied intelligence and delivers a practical data-to-benchmark pipeline to generalize across diverse embodied tasks.
Abstract
We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC
