Table of Contents
Fetching ...

SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents

Zhirui Zhang, Hongbo Zhang, Haoxiang Fei, Zhiyuan Bao, Yubin Chen, Zhengyu Lei, Ziyue Liu, Yixuan Sun, Mingkun Xiao, Zihang Ye, Yu Zhang, Hongcheng Zhu, Yuxiang Wen, Heung-Yeung Shum

TL;DR

SWE-AGI tackles the problem of autonomous, specification-driven software construction by introducing an open benchmark in MoonBit that requires end-to-end production-grade implementations from authoritative specifications. It uses a fixed API scaffold, spec-first declarations, and private evaluation tests to force long-horizon architectural reasoning and to minimize data leakage from code retrieval. Across 22 tasks, frontier models solve all easy tasks but show substantial degradation on medium and hard tasks, with performance closely tied to code reading and system coherence rather than raw code generation. The findings indicate that autonomous software engineering from explicit specifications is increasingly viable yet currently imperfect, highlighting bottlenecks in reading and integration and pointing to future work on distributed systems, library-centric development, and richer multi-modal toolchains.

Abstract

Although large language models (LLMs) have demonstrated impressive coding capabilities, their ability to autonomously build production-scale software from explicit specifications remains an open question. We introduce SWE-AGI, an open-source benchmark for evaluating end-to-end, specification-driven construction of software systems written in MoonBit. SWE-AGI tasks require LLM-based agents to implement parsers, interpreters, binary decoders, and SAT solvers strictly from authoritative standards and RFCs under a fixed API scaffold. Each task involves implementing 1,000-10,000 lines of core logic, corresponding to weeks or months of engineering effort for an experienced human developer. By leveraging the nascent MoonBit ecosystem, SWE-AGI minimizes data leakage, forcing agents to rely on long-horizon architectural reasoning rather than code retrieval. Across frontier models, gpt-5.3-codex achieves the best overall performance (solving 19/22 tasks, 86.4%), outperforming claude-opus-4.6 (15/22, 68.2%), and kimi-2.5 exhibits the strongest performance among open-source models. Performance degrades sharply with increasing task difficulty, particularly on hard, specification-intensive systems. Behavioral analysis further reveals that as codebases scale, code reading, rather than writing, becomes the dominant bottleneck in AI-assisted development. Overall, while specification-driven autonomous software engineering is increasingly viable, substantial challenges remain before it can reliably support production-scale development.

SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents

TL;DR

SWE-AGI tackles the problem of autonomous, specification-driven software construction by introducing an open benchmark in MoonBit that requires end-to-end production-grade implementations from authoritative specifications. It uses a fixed API scaffold, spec-first declarations, and private evaluation tests to force long-horizon architectural reasoning and to minimize data leakage from code retrieval. Across 22 tasks, frontier models solve all easy tasks but show substantial degradation on medium and hard tasks, with performance closely tied to code reading and system coherence rather than raw code generation. The findings indicate that autonomous software engineering from explicit specifications is increasingly viable yet currently imperfect, highlighting bottlenecks in reading and integration and pointing to future work on distributed systems, library-centric development, and richer multi-modal toolchains.

Abstract

Although large language models (LLMs) have demonstrated impressive coding capabilities, their ability to autonomously build production-scale software from explicit specifications remains an open question. We introduce SWE-AGI, an open-source benchmark for evaluating end-to-end, specification-driven construction of software systems written in MoonBit. SWE-AGI tasks require LLM-based agents to implement parsers, interpreters, binary decoders, and SAT solvers strictly from authoritative standards and RFCs under a fixed API scaffold. Each task involves implementing 1,000-10,000 lines of core logic, corresponding to weeks or months of engineering effort for an experienced human developer. By leveraging the nascent MoonBit ecosystem, SWE-AGI minimizes data leakage, forcing agents to rely on long-horizon architectural reasoning rather than code retrieval. Across frontier models, gpt-5.3-codex achieves the best overall performance (solving 19/22 tasks, 86.4%), outperforming claude-opus-4.6 (15/22, 68.2%), and kimi-2.5 exhibits the strongest performance among open-source models. Performance degrades sharply with increasing task difficulty, particularly on hard, specification-intensive systems. Behavioral analysis further reveals that as codebases scale, code reading, rather than writing, becomes the dominant bottleneck in AI-assisted development. Overall, while specification-driven autonomous software engineering is increasingly viable, substantial challenges remain before it can reliably support production-scale development.
Paper Structure (21 sections, 3 figures, 9 tables)

This paper contains 21 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: SWE-AGI benchmark execution pipeline. From a cold-start starter repository (inputs: TASK.md, normative specs/, a MoonBit scaffold, and public tests), an autonomous agent iterates over design/implementation and local testing, submits the project for evaluation (via swe-agi-submit), receives pass/fail feedback, and repeats until a verified submission passes.
  • Figure 2: Conceptual contrast between SWE-bench jimenez2023swebench and SWE-AGI evaluation settings.
  • Figure 3: Declaration-first, spec-driven workflow in MoonBit. The declare keyword fixes public types and function signatures (e.g., parser entry points and test-schema encoders) before implementation.