VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning
Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin
TL;DR
<3-5 sentence high-level summary> VIKI-Bench provides the first hierarchical benchmark for embodied multi-agent cooperation with three levels of visual reasoning: agent activation, task planning, and trajectory perception, incorporating diverse robot embodiments and multi-view observations. The paper introduces VIKI-R, a two-stage framework that first learns visual reasoning via Chain-of-Thought supervised fine-tuning and then refines policy with Grouped Relative Proximal Optimization reinforcement learning under hierarchical rewards. Empirical results show VIKI-R outperforms strong baselines across all task levels and reveals how hierarchical supervision and RL enable emergent compositional collaboration among heterogeneous agents. The work delivers a unified testbed and a practical learning paradigm for advancing visual-driven, embodied multi-agent cooperation with potential applications in real-world robotics and automation.
Abstract
Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.
