Leveraging Large Language Models for Automated Reproduction of Networking Research Results
Yining Jiang, Yunxin Xu, Wenyun Xu, Yufan Zhu, Tangtang He, Haiying Huang, Letian Zhu, Qingyu Song, Qiang Su, Lizhao You, Lu Tang, Wanjin Feng, Yuchao Zhang, Linghe Kong, Qiao Xiang, Jiwu Shu
TL;DR
This work tackles the reproducibility crisis in networking research by introducing RepLLM, a memory-backed, multi-agent system that converts academic papers into executable networking code. By decomposing the task into Content Parsing, Architecture Design, Code Generation, and Audit & Repair, RepLLM achieves robust paper-to-code synthesis through explicit context sharing and a sandboxed, iterative refinement loop. Empirical results across multiple conferences show that RepLLM outperforms baselines in code reliability and semantic alignment, enabling high-fidelity reproduction with limited human intervention. The framework significantly lowers reproduction costs and provides a scalable path toward transparent, reproducible networking research.
Abstract
Code reproduction is a cornerstone of scientific validity, yet it remains a formidable challenge in computer networking research due to the scarcity of open-source implementations and the complexity of heterogeneous system architectures. While Large Language Models have demonstrated potential in code generation, existing code generation frameworks often fail to address the long-context constraints and intricate logical dependencies required to reproduce network systems from academic papers. To facilitate result reproduction, we introduce \emph{RepLLM}, an end-to-end multi-agent framework designed to automate the transformation of network research into executable code. RepLLM features a novel collaborative architecture comprising four specialized agents -- Content Parsing, Architecture Design, Code Generation, and Audit \& Repair -- coordinated through an explicit \textit{Shared Memory} mechanism to ensure global context consistency. With the enhancement of Chain-of-Thought LLM reasoning and a sandbox-isolated static-dynamic debugging methodology, our framework effectively resolves semantic discrepancies and runtime errors. Extensive evaluations on representative papers from SIGCOMM and NSDI demonstrate that RepLLM significantly outperforms state-of-the-art baselines in generating compile-ready and logically correct systems. Results further demonstrate that RepLLM facilitates the reproduction of 80\% of the original benchmarks with only four hours of human intervention.
