Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
Yijia Shao, Vinay Samuel, Yucheng Jiang, John Yang, Diyi Yang
TL;DR
Collaborative Gym (Co-Gym) introduces a dual-control, non-turn-taking framework for human–LM agent collaboration within shared task environments, plus an evaluation suite that tracks both outcomes and collaboration processes. It implements a flexible task-environment interface, a collaboration protocol, and RSS-backed notifications to support asynchronous interaction. Across Travel Planning, Related Work, and Tabular Analysis, experiments show that collaborative agents—especially with situational planning—outperform fully autonomous baselines in real-user trials, while revealing persistent communication and situational-awareness challenges. The framework is open-source under MIT, enabling broad experimentation and targeted improvements in human–AI teamwork. Co-Gym thus provides a principled platform to study when and how human-in-the-loop collaboration yields tangible benefits and how to mitigate current LM limitations.
Abstract
While the advancement of large language models has spurred the development of AI agents to automate tasks, numerous use cases inherently require agents to collaborate with humans due to humans' latent preferences, domain expertise, or the need for control. To facilitate the study of human-agent collaboration, we introduce Collaborative Gym (Co-Gym), an open framework for developing and evaluating collaborative agents that engage in bidirectional communication with humans while interacting with task environments. We describe how the framework enables the implementation of new task environments and coordination between humans and agents through a flexible, non-turn-taking interaction paradigm, along with an evaluation suite that assesses both collaboration outcomes and processes. Our framework provides both a simulated condition with a reliable user simulator and a real-world condition with an interactive web application. Initial benchmark experiments across three representative tasks -- creating travel plans, writing related work sections, and analyzing tabular data -- demonstrate the benefits of human-agent collaboration: The best-performing collaborative agents consistently outperform their fully autonomous counterparts in task performance, achieving win rates of 86% in Travel Planning, 74% in Tabular Analysis, and 66% in Related Work when evaluated by real users. Despite these improvements, our evaluation reveals persistent limitations in current language models and agents, with communication and situational awareness failures observed in 65% and 40% of cases in the real condition, respectively. Released under the permissive MIT license, Co-Gym supports the addition of new task environments and can be used to develop collaborative agent applications, while its evaluation suite enables assessment and improvement of collaborative agents.
