CoWork-X: Experience-Optimized Co-Evolution for Multi-Agent Collaboration System

Zexin Lin; Jiachen Yu; Haoyang Zhang; Yuzhao Li; Zhonghang Li; Yujiu Yang; Junjie Wang; Xiaoqiang Ji

CoWork-X: Experience-Optimized Co-Evolution for Multi-Agent Collaboration System

Zexin Lin, Jiachen Yu, Haoyang Zhang, Yuzhao Li, Zhonghang Li, Yujiu Yang, Junjie Wang, Xiaoqiang Ji

TL;DR

CoWork-X tackles real-time coordination and continual cross-episode adaptation under strict online budgets by introducing an Execute--Optimize loop that separates fast, HTN-based execution from offline, patch-style skill updates. A shared HTN skill library $\,\mathcal{S}_k$ is incrementally improved by a post-episode Co-Optimizer using episode logs, enabling stable multi-agent co-evolution. In Overcooked-like benchmarks, CoWork-X achieves sustained gains with zero online tokens and markedly lower latency ($\approx$ $2.6$ s per episode) compared to baselines that rely on frequent in-episode LLM reasoning, while generalizing across multiple LLM backbones. The work demonstrates practical, scalable cross-episode collaboration and highlights the value of log-grounded, verifier-driven skill consolidation for real-time multi-agent systems.

Abstract

Large language models are enabling language-conditioned agents in interactive environments, but highly cooperative tasks often impose two simultaneous constraints: sub-second real-time coordination and sustained multi-episode adaptation under a strict online token budget. Existing approaches either rely on frequent in-episode reasoning that induces latency and timing jitter, or deliver post-episode improvements through unstructured text that is difficult to compile into reliable low-cost execution. We propose CoWork-X, an active co-evolution framework that casts peer collaboration as a closed-loop optimization problem across episodes, inspired by fast--slow memory separation. CoWork-X instantiates a Skill-Agent that executes via HTN (hierarchical task network)-based skill retrieval from a structured, interpretable, and compositional skill library, and a post-episode Co-Optimizer that performs patch-style skill consolidation with explicit budget constraints and drift regularization. Experiments in challenging Overcooked-AI-like realtime collaboration benchmarks demonstrate that CoWork-X achieves stable, cumulative performance gains while steadily reducing online latency and token usage.

CoWork-X: Experience-Optimized Co-Evolution for Multi-Agent Collaboration System

TL;DR

is incrementally improved by a post-episode Co-Optimizer using episode logs, enabling stable multi-agent co-evolution. In Overcooked-like benchmarks, CoWork-X achieves sustained gains with zero online tokens and markedly lower latency (

s per episode) compared to baselines that rely on frequent in-episode LLM reasoning, while generalizing across multiple LLM backbones. The work demonstrates practical, scalable cross-episode collaboration and highlights the value of log-grounded, verifier-driven skill consolidation for real-time multi-agent systems.

Abstract

Paper Structure (28 sections, 11 figures, 5 tables)

This paper contains 28 sections, 11 figures, 5 tables.

Introduction
Related Work
Real-Time Interactive Collaboration
Slow Adaptation in Multi-Agent Systems
The CoWork-X Framework
Problem Formulation
Overview: Execute--Optimize Closed Loop
Skill Agent: HTN-Executable Skill Memory
Co-Optimizer: Skill Consolidation and Iterative Updates
Experimental Settings
Environment and Task Configuration
Baseline Methods
Evaluation Metrics
Implementation Details
Results and Analysis
...and 13 more sections

Figures (11)

Figure 1: CoWork-X overview. Skill-Agents execute via a shared skill library, and an LLM Co-Optimizer updates it from episode logs for closed-loop co-evolution.
Figure 2: CoWork-X Execute--Optimize loop. A Skill-Agent executes an HTN policy from $\mathcal{S}_k$, then an LLM Co-Optimizer diagnoses episode logs and patches $\mathcal{S}_k\!\to\!\mathcal{S}_{k+1}$ (e.g., adding preconditions), improving performance in subsequent episodes.
Figure 3: Performance across 30 episodes. CoWork-X shows consistent improvement across iterations. Baselines show unstable performance from frequent online LLM calls.
Figure 4: Score efficiency. CoWork-X achieves higher score-per-resource ratios: 0.92 score/s and 5.9 score/1k tokens, versus DPT-WToM's 0.09 and 0.20.
Figure 5: Ablation study of different families of models. Each model was tested in 5 independent runs, each with 10 iterations.
...and 6 more figures

CoWork-X: Experience-Optimized Co-Evolution for Multi-Agent Collaboration System

TL;DR

Abstract

CoWork-X: Experience-Optimized Co-Evolution for Multi-Agent Collaboration System

Authors

TL;DR

Abstract

Table of Contents

Figures (11)