CRAFT: Coaching Reinforcement Learning Autonomously using Foundation Models for Multi-Robot Coordination Tasks
Seoyeon Choi, Kanghyun Ryu, Jonghoon Ock, Negar Mehr
TL;DR
CRAFT introduces a coach-like framework that leverages foundation models (LLMs and VLMs) to automatically generate curricula for long-horizon multi-robot coordination tasks, design executable, reward-based subtasks, and iteratively refine those rewards via a VLM-guided loop. The method decomposes target tasks into subtasks, trains decentralized policies under CTDE, and uses visual/evidence-based evaluation to guide reward refinement, enabling coordination behaviors in multi-quadruped navigation and bimanual manipulation. Across simulation and hardware, CRAFT outperforms baselines that rely on environment rewards or no curricula, demonstrating the value of automated curriculum design and reward shaping for complex MARL in robotics. The work also highlights the stochasticity of foundation-model outputs as a limitation, suggesting avenues for increasing reliability and stability in future integrations of coaching with robot learning.
Abstract
Multi-Agent Reinforcement Learning (MARL) provides a powerful framework for learning coordination in multi-agent systems. However, applying MARL to robotics still remains challenging due to high-dimensional continuous joint action spaces, complex reward design, and non-stationary transitions inherent to decentralized settings. On the other hand, humans learn complex coordination through staged curricula, where long-horizon behaviors are progressively built upon simpler skills. Motivated by this, we propose CRAFT: Coaching Reinforcement learning Autonomously using Foundation models for multi-robot coordination Tasks, a framework that leverages the reasoning capabilities of foundation models to act as a "coach" for multi-robot coordination. CRAFT automatically decomposes long-horizon coordination tasks into sequences of subtasks using the planning capability of Large Language Models (LLMs). In what follows, CRAFT trains each subtask using reward functions generated by LLM, and refines them through a Vision Language Model (VLM)-guided reward-refinement loop. We evaluate CRAFT on multi-quadruped navigation and bimanual manipulation tasks, demonstrating its capability to learn complex coordination behaviors. In addition, we validate the multi-quadruped navigation policy in real hardware experiments.
