Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework
Dong Wang, Yang Li, Ansong Ni, Ching-Feng Yeh, Youssef Emad, Xinjie Lei, Liam Robbins, Karthik Padthe, Hu Xu, Xian Li, Asli Celikyilmaz, Ramya Raghavendra, Lifei Huang, Carole-Jean Wu, Shang-Wen Li
TL;DR
Matrix tackles the scalability bottleneck of multi-agent synthetic data generation by replacing centralized orchestration with a decentralized, peer-to-peer, message-driven runtime. It embeds control and state in serialized orchestrator messages that pass between stateless agent actors, while heavy compute is offloaded to distributed services on Ray, enabling fine-grained, row-level scheduling across tens of thousands of concurrent workflows. Across Coral, NaturalReasoning, and Tau2-bench, Matrix achieves $2$--$15\times$ higher token throughput with output quality on par with baselines, demonstrating robust performance gains and flexible adaptability. With modular components, Hydra-based configuration, and open-source integration (Ray, SLURM, vLLM, SGLang, Apptainer), Matrix offers a practical, scalable framework for broad synthetic data generation and agentic experimentation.
Abstract
Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present \textbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$--$15\times$ higher data generation throughput under identical hardware resources, without compromising output quality.
