Table of Contents
Fetching ...

VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin

TL;DR

<3-5 sentence high-level summary> VIKI-Bench provides the first hierarchical benchmark for embodied multi-agent cooperation with three levels of visual reasoning: agent activation, task planning, and trajectory perception, incorporating diverse robot embodiments and multi-view observations. The paper introduces VIKI-R, a two-stage framework that first learns visual reasoning via Chain-of-Thought supervised fine-tuning and then refines policy with Grouped Relative Proximal Optimization reinforcement learning under hierarchical rewards. Empirical results show VIKI-R outperforms strong baselines across all task levels and reveals how hierarchical supervision and RL enable emergent compositional collaboration among heterogeneous agents. The work delivers a unified testbed and a practical learning paradigm for advancing visual-driven, embodied multi-agent cooperation with potential applications in real-world robotics and automation.

Abstract

Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.

VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

TL;DR

<3-5 sentence high-level summary> VIKI-Bench provides the first hierarchical benchmark for embodied multi-agent cooperation with three levels of visual reasoning: agent activation, task planning, and trajectory perception, incorporating diverse robot embodiments and multi-view observations. The paper introduces VIKI-R, a two-stage framework that first learns visual reasoning via Chain-of-Thought supervised fine-tuning and then refines policy with Grouped Relative Proximal Optimization reinforcement learning under hierarchical rewards. Empirical results show VIKI-R outperforms strong baselines across all task levels and reveals how hierarchical supervision and RL enable emergent compositional collaboration among heterogeneous agents. The work delivers a unified testbed and a practical learning paradigm for advancing visual-driven, embodied multi-agent cooperation with potential applications in real-world robotics and automation.

Abstract

Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.

Paper Structure

This paper contains 51 sections, 9 equations, 8 figures, 13 tables, 1 algorithm.

Figures (8)

  • Figure 1: Embodied multi-agent cooperation involves two key aspects: (1) cross-embodiment collaboration, where different embodiments are required for different tasks (e.g., washing requires a humanoid, while only wheeled robots can fetch from high cabinets); and (2) efficient coordination, where agents work in parallel (e.g., multiple arms passing apples while a humanoid washes them) to improve overall efficiency. To support such fine-grained teamwork, we propose VIKI-Bench, which structures the process into three levels of visual reasoning: Level 1 – agent activation, Level 2 – task planning, and Level 3 – trajectory perception, aiming to realize an embodied multi-agent system.
  • Figure 2: Overview of VIKI-Bench. VIKI-Bench is a hierarchical benchmark for evaluation on multi-agent embodied cooperation, featuring visual reasoning tasks in three levels: (1) Agent Activation, where robots are selected based on the scene image and the task context; (2) Task Planning, where a structured multi-agent action plan is generated, verified, and refined; and (3) Trajectory Perception, where the fine-grained motion trajectory of each agent is tracked from egocentric views. The benchmark involves diverse robot types and complex 3D environments, with multiple metrics for quantitative evaluation.
  • Figure 3: Framework of VIKI-R. We adopted supervised fine-tuning (SFT) and reinforcement fine-tuning on the VIKI dataset, incorporating format and accuracy rewards to optimize the policy model.
  • Figure 4: Response length of the Qwen2.5-VL-3B/7B-Instruct model at training time.
  • Figure 5: GRPO reward mean curve about task VIKI-L1
  • ...and 3 more figures