Table of Contents
Fetching ...

CRAFT: Coaching Reinforcement Learning Autonomously using Foundation Models for Multi-Robot Coordination Tasks

Seoyeon Choi, Kanghyun Ryu, Jonghoon Ock, Negar Mehr

TL;DR

CRAFT introduces a coach-like framework that leverages foundation models (LLMs and VLMs) to automatically generate curricula for long-horizon multi-robot coordination tasks, design executable, reward-based subtasks, and iteratively refine those rewards via a VLM-guided loop. The method decomposes target tasks into subtasks, trains decentralized policies under CTDE, and uses visual/evidence-based evaluation to guide reward refinement, enabling coordination behaviors in multi-quadruped navigation and bimanual manipulation. Across simulation and hardware, CRAFT outperforms baselines that rely on environment rewards or no curricula, demonstrating the value of automated curriculum design and reward shaping for complex MARL in robotics. The work also highlights the stochasticity of foundation-model outputs as a limitation, suggesting avenues for increasing reliability and stability in future integrations of coaching with robot learning.

Abstract

Multi-Agent Reinforcement Learning (MARL) provides a powerful framework for learning coordination in multi-agent systems. However, applying MARL to robotics still remains challenging due to high-dimensional continuous joint action spaces, complex reward design, and non-stationary transitions inherent to decentralized settings. On the other hand, humans learn complex coordination through staged curricula, where long-horizon behaviors are progressively built upon simpler skills. Motivated by this, we propose CRAFT: Coaching Reinforcement learning Autonomously using Foundation models for multi-robot coordination Tasks, a framework that leverages the reasoning capabilities of foundation models to act as a "coach" for multi-robot coordination. CRAFT automatically decomposes long-horizon coordination tasks into sequences of subtasks using the planning capability of Large Language Models (LLMs). In what follows, CRAFT trains each subtask using reward functions generated by LLM, and refines them through a Vision Language Model (VLM)-guided reward-refinement loop. We evaluate CRAFT on multi-quadruped navigation and bimanual manipulation tasks, demonstrating its capability to learn complex coordination behaviors. In addition, we validate the multi-quadruped navigation policy in real hardware experiments.

CRAFT: Coaching Reinforcement Learning Autonomously using Foundation Models for Multi-Robot Coordination Tasks

TL;DR

CRAFT introduces a coach-like framework that leverages foundation models (LLMs and VLMs) to automatically generate curricula for long-horizon multi-robot coordination tasks, design executable, reward-based subtasks, and iteratively refine those rewards via a VLM-guided loop. The method decomposes target tasks into subtasks, trains decentralized policies under CTDE, and uses visual/evidence-based evaluation to guide reward refinement, enabling coordination behaviors in multi-quadruped navigation and bimanual manipulation. Across simulation and hardware, CRAFT outperforms baselines that rely on environment rewards or no curricula, demonstrating the value of automated curriculum design and reward shaping for complex MARL in robotics. The work also highlights the stochasticity of foundation-model outputs as a limitation, suggesting avenues for increasing reliability and stability in future integrations of coaching with robot learning.

Abstract

Multi-Agent Reinforcement Learning (MARL) provides a powerful framework for learning coordination in multi-agent systems. However, applying MARL to robotics still remains challenging due to high-dimensional continuous joint action spaces, complex reward design, and non-stationary transitions inherent to decentralized settings. On the other hand, humans learn complex coordination through staged curricula, where long-horizon behaviors are progressively built upon simpler skills. Motivated by this, we propose CRAFT: Coaching Reinforcement learning Autonomously using Foundation models for multi-robot coordination Tasks, a framework that leverages the reasoning capabilities of foundation models to act as a "coach" for multi-robot coordination. CRAFT automatically decomposes long-horizon coordination tasks into sequences of subtasks using the planning capability of Large Language Models (LLMs). In what follows, CRAFT trains each subtask using reward functions generated by LLM, and refines them through a Vision Language Model (VLM)-guided reward-refinement loop. We evaluate CRAFT on multi-quadruped navigation and bimanual manipulation tasks, demonstrating its capability to learn complex coordination behaviors. In addition, we validate the multi-quadruped navigation policy in real hardware experiments.

Paper Structure

This paper contains 26 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Example of curriculum refinement for task lift and balance the pot. Three different candidate curricula $\mathcal{C}^1$ to $\mathcal{C}^3$, generated by the curriculum LLM, are re-provided to the LLM for refinement. In $\mathcal{C}^1$, Task 1 focuses only on minimizing distance, while Task 1 in $\mathcal{C}^3$ is defined as minimizing distance and matching orientation. In contrast, Task 3 and Task 4 in $\mathcal{C}^1$ break down the lifting into two stages of first lifting halfway and then to a full height, whereas $\mathcal{C}^3$ represents lifting as a single task. The curriculum LLM merges these candidates into a final curriculum $\mathcal{C}$ by selecting the stronger tasks definitions from each candidate.
  • Figure 2: Example of reward refinement of subtask Coordinate Preliminary Lift. Through the first reward-refinement loop, $R^1_{k=3}$ was produced and the evaluation VLM marked the policy as a failure since the pot never reached the required elevation of 0.05 m. The reward component learning curves were then passed to the advice VLM, which identified that lift_reward was too weak compared to balance_reward. It recommended removing the square on elevation, increasing the lift weight, and decreasing the balance weight. The revised reward $R^2_{k=3}$ reflects these changes: the square on elevation was removed, the lift weight increased from 80 to 200, and the balance weight decreased from 2 to 1. With this reward, the policy successfully achieved the 0.05 m elevation and satisfied the subtask.
  • Figure 3: Illustrative snapshot showing successful execution of multi-agent coordination tasks by a CRAFT-trained policy.
  • Figure 4: Illustrative snapshot of policies trained with env_reward without curriculum. The policy shows suboptimal behaviors, such as only one agent passing the gate or only managed to grasp the pot, rather than lifting it.
  • Figure 5: Success Rate of the top-three curricula from each environment. Each environment is evaluated by 100 random initial conditions. CRAFT achieves highest success rate on every environment, demonstrating its ability to learn complex coordination tasks that are challenging to learn without curriculum or well-crafted reward functions.
  • ...and 1 more figures