Table of Contents
Fetching ...

Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration

Yijia Shao, Vinay Samuel, Yucheng Jiang, John Yang, Diyi Yang

TL;DR

Collaborative Gym (Co-Gym) introduces a dual-control, non-turn-taking framework for human–LM agent collaboration within shared task environments, plus an evaluation suite that tracks both outcomes and collaboration processes. It implements a flexible task-environment interface, a collaboration protocol, and RSS-backed notifications to support asynchronous interaction. Across Travel Planning, Related Work, and Tabular Analysis, experiments show that collaborative agents—especially with situational planning—outperform fully autonomous baselines in real-user trials, while revealing persistent communication and situational-awareness challenges. The framework is open-source under MIT, enabling broad experimentation and targeted improvements in human–AI teamwork. Co-Gym thus provides a principled platform to study when and how human-in-the-loop collaboration yields tangible benefits and how to mitigate current LM limitations.

Abstract

While the advancement of large language models has spurred the development of AI agents to automate tasks, numerous use cases inherently require agents to collaborate with humans due to humans' latent preferences, domain expertise, or the need for control. To facilitate the study of human-agent collaboration, we introduce Collaborative Gym (Co-Gym), an open framework for developing and evaluating collaborative agents that engage in bidirectional communication with humans while interacting with task environments. We describe how the framework enables the implementation of new task environments and coordination between humans and agents through a flexible, non-turn-taking interaction paradigm, along with an evaluation suite that assesses both collaboration outcomes and processes. Our framework provides both a simulated condition with a reliable user simulator and a real-world condition with an interactive web application. Initial benchmark experiments across three representative tasks -- creating travel plans, writing related work sections, and analyzing tabular data -- demonstrate the benefits of human-agent collaboration: The best-performing collaborative agents consistently outperform their fully autonomous counterparts in task performance, achieving win rates of 86% in Travel Planning, 74% in Tabular Analysis, and 66% in Related Work when evaluated by real users. Despite these improvements, our evaluation reveals persistent limitations in current language models and agents, with communication and situational awareness failures observed in 65% and 40% of cases in the real condition, respectively. Released under the permissive MIT license, Co-Gym supports the addition of new task environments and can be used to develop collaborative agent applications, while its evaluation suite enables assessment and improvement of collaborative agents.

Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration

TL;DR

Collaborative Gym (Co-Gym) introduces a dual-control, non-turn-taking framework for human–LM agent collaboration within shared task environments, plus an evaluation suite that tracks both outcomes and collaboration processes. It implements a flexible task-environment interface, a collaboration protocol, and RSS-backed notifications to support asynchronous interaction. Across Travel Planning, Related Work, and Tabular Analysis, experiments show that collaborative agents—especially with situational planning—outperform fully autonomous baselines in real-user trials, while revealing persistent communication and situational-awareness challenges. The framework is open-source under MIT, enabling broad experimentation and targeted improvements in human–AI teamwork. Co-Gym thus provides a principled platform to study when and how human-in-the-loop collaboration yields tangible benefits and how to mitigate current LM limitations.

Abstract

While the advancement of large language models has spurred the development of AI agents to automate tasks, numerous use cases inherently require agents to collaborate with humans due to humans' latent preferences, domain expertise, or the need for control. To facilitate the study of human-agent collaboration, we introduce Collaborative Gym (Co-Gym), an open framework for developing and evaluating collaborative agents that engage in bidirectional communication with humans while interacting with task environments. We describe how the framework enables the implementation of new task environments and coordination between humans and agents through a flexible, non-turn-taking interaction paradigm, along with an evaluation suite that assesses both collaboration outcomes and processes. Our framework provides both a simulated condition with a reliable user simulator and a real-world condition with an interactive web application. Initial benchmark experiments across three representative tasks -- creating travel plans, writing related work sections, and analyzing tabular data -- demonstrate the benefits of human-agent collaboration: The best-performing collaborative agents consistently outperform their fully autonomous counterparts in task performance, achieving win rates of 86% in Travel Planning, 74% in Tabular Analysis, and 66% in Related Work when evaluated by real users. Despite these improvements, our evaluation reveals persistent limitations in current language models and agents, with communication and situational awareness failures observed in 65% and 40% of cases in the real condition, respectively. Released under the permissive MIT license, Co-Gym supports the addition of new task environments and can be used to develop collaborative agent applications, while its evaluation suite enables assessment and improvement of collaborative agents.

Paper Structure

This paper contains 33 sections, 20 figures, 10 tables.

Figures (20)

  • Figure 1: Collaborative Gym (Co-Gym) enables collaboration between humans and LM agents within a task environment. Left: Human adds requests and sends multiple messages without waiting for agent responses. Right: Human rates collaboration highly as the agent proactively seeks help when uncertain about package installation.
  • Figure 2: Overview of Co-Gym framework. The task environment interface (CoEnv) requires specifying the task description, action space, and observation space (§\ref{['sec:task_env']}). The collaboration acts and notification protocol (§\ref{['sec:async_interaction']}) are shared across tasks. For example, when the agent updates the public component, both parties are notified with the new observation (blue solid lines); parties can coordinate by sending messages (green dashed lines).
  • Figure 3: Illustration of Co-Gym (Simulated). The human is simulated by a language model using hidden information associated with each task case. The hidden information is not visible to the LM agent.
  • Figure 4: Screenshots of the interactive web application for Co-Gym (Real).
  • Figure 5: The workflow of Collaborative Agent with Situational Planning to process a notification received by the event loop in AgentNode.
  • ...and 15 more figures