Table of Contents
Fetching ...

Learning to Coordinate with Experts

Mohamad H. Danesh, Nguyen X. Khanh, Tu Trinh, Benjamin Plaut

TL;DR

This work formalizes Yield-or-Request Control (YRC-0), an unsupervised learning-to-coordinate-with-experts problem, and introduces YRC-Bench, a robust, open benchmark spanning MiniGrid, Procgen, and CLIPort to study cross-environment generalization. It proposes a soft-constraint reward with interpretable trade-offs via a parameter $\alpha$ and an AUC-based evaluation to compare methods across multiple query costs, complemented by a simulated validator (RLOracle) to guide policy selection without access to the true test distribution. Through a large-scale study of 2,600 policies across 19 environments, the authors show there is no universally best method, reveal substantial room for improvement, and argue that current gains are bottlenecked by narrow policy spaces rather than validation quality. The work provides practical recommendations and a rigorous benchmark to spur development of more expressive coordination policies and validation strategies for safe, generalizable human-AI collaboration.

Abstract

When deployed in the real world, AI agents will inevitably face challenges that exceed their individual capabilities. Leveraging assistance from experts, whether humans or highly capable AI systems, can significantly improve both safety and performance in such situations. Since expert assistance is costly, a central challenge is determining when to consult an expert. In this paper, we explore a novel variant of this problem, termed YRC-0, in which an agent must learn to collaborate with an expert in new environments in an unsupervised manner--that is, without interacting with the expert during training. This setting motivates the development of low-cost, robust approaches for training expert-leveraging agents. To support research in this area, we introduce YRC-Bench, an open-source benchmark that instantiates YRC-0 across diverse environments. YRC-Bench provides a standardized Gym-like API, simulated experts, an evaluation pipeline, and implementations of popular baselines. Toward tackling YRC-0, we propose a validation strategy and evaluate a range of learning methods, offering insights that can inform future research. Codebase: github.com/modanesh/YRC-Bench

Learning to Coordinate with Experts

TL;DR

This work formalizes Yield-or-Request Control (YRC-0), an unsupervised learning-to-coordinate-with-experts problem, and introduces YRC-Bench, a robust, open benchmark spanning MiniGrid, Procgen, and CLIPort to study cross-environment generalization. It proposes a soft-constraint reward with interpretable trade-offs via a parameter and an AUC-based evaluation to compare methods across multiple query costs, complemented by a simulated validator (RLOracle) to guide policy selection without access to the true test distribution. Through a large-scale study of 2,600 policies across 19 environments, the authors show there is no universally best method, reveal substantial room for improvement, and argue that current gains are bottlenecked by narrow policy spaces rather than validation quality. The work provides practical recommendations and a rigorous benchmark to spur development of more expressive coordination policies and validation strategies for safe, generalizable human-AI collaboration.

Abstract

When deployed in the real world, AI agents will inevitably face challenges that exceed their individual capabilities. Leveraging assistance from experts, whether humans or highly capable AI systems, can significantly improve both safety and performance in such situations. Since expert assistance is costly, a central challenge is determining when to consult an expert. In this paper, we explore a novel variant of this problem, termed YRC-0, in which an agent must learn to collaborate with an expert in new environments in an unsupervised manner--that is, without interacting with the expert during training. This setting motivates the development of low-cost, robust approaches for training expert-leveraging agents. To support research in this area, we introduce YRC-Bench, an open-source benchmark that instantiates YRC-0 across diverse environments. YRC-Bench provides a standardized Gym-like API, simulated experts, an evaluation pipeline, and implementations of popular baselines. Toward tackling YRC-0, we propose a validation strategy and evaluate a range of learning methods, offering insights that can inform future research. Codebase: github.com/modanesh/YRC-Bench

Paper Structure

This paper contains 41 sections, 3 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Illustration of the YRC-0 problem. Left: an agent learns tasks on its own (e.g., RL with environment rewards). Right: at test time, it has to perform novel tasks with the help of an expert. While learning in isolation, how can the agent develop a collaboration strategy that will be effective at test time?
  • Figure 2: Simulation of the YRC-0 problem. A coordination environment encapsulates two policies: novice and expert. A coordination policy decides which policy will be used to generate the next action. The coordination policy's decision is then translated into an environment action and executed.
  • Figure 3: (a) Training (top) and test (bottom) tasks in DoorKey (Minigrid), CoinRun (Procgen), and stack-block-pyramid (CLIPort). (b) The generalization gaps of the novice: its average return on test tasks, normalized by average return of expert. (c) To evaluate policies, we compute the mean and standard deviation of the area under the curve defined by the average return at varying values of $\alpha$ and random seed.
  • Figure 4: Number of environments where a method attains the highest mean AUC; solid bars denote use of our validation method.
  • Figure 5: Test performance of learning methods across environments, normalized by the performance of the best RLOracle method. For each environment, we show three variants: the best performing method with simulated validation (i.e., excluding AlwaysNovice, AlwaysExpert, and AlwaysRandom0.5), the same method with an oracle validator (+oracle validator), and the same method with an oracle proposer, which employs a deep RL approach (+oracle proposer). The gaps between the latter two variants and the original indicate potential performance gains that could be achieved by improving the replaced components. Error bars represent $2\times$ standard deviation.
  • ...and 7 more figures