Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems
Chenyu Zhao, Shenglin Zhang, Zeshun Huang, Weilin Jin, Yongqian Sun, Dan Pei, Chaoyun Zhang, Qingwei Lin, Chetan Bansal, Saravan Rajmohan, Minghua Ma
TL;DR
Can Language Models Go Beyond Coding? presents Build-bench, the first executable, architecture-aware benchmark for evaluating whether LLMs can diagnose, repair, and verify software build failures during cross-ISA migration. By coupling real build environments (OBS) with an iterative, tool-augmented repair loop mediated by MCP, the study assesses six LLMs across 268 real-world failures from x86_64 to aarch64 and back, revealing that iterative feedback substantially improves repair success while long build logs and multi-file dependencies remain key bottlenecks. GPT-5 leads in forward repairs with a peak around 63.19% and demonstrates how tool orchestration and fine-grained prompts influence outcomes, though overall robustness across architectures is still limited. The work delivers a practical framework and quantitative baselines to guide future research on LLM-assisted cross-architecture software maintenance and automated build repair.
Abstract
Large language models (LLMs) have shown growing potential in software engineering, yet few benchmarks evaluate their ability to repair software during migration across instruction set architectures (ISAs). Cross-ISA migration, such as between x86_64 and aarch64, requires handling complex dependencies, heterogeneous toolchains, and long build logs while ensuring executable verification. To address this challenge, we present Build-bench, an end-to-end benchmark that systematically evaluates the capability of LLMs to repair build failures in cross-ISA settings. Build-bench collects 268 real-world failed packages and integrates auxiliary tools including Structure Extraction, File Content Extraction, Content Modification, and Build Verification to support autonomous, tool-augmented reasoning. The repair process operates in an iterative loop where, upon failure, the model receives updated build logs and previous repair outcomes to refine subsequent attempts. Through a comparative evaluation of six representative LLMs, Build-bench reveals that current models achieve a maximum build success rate of 63% and tool usage patterns differ significantly across models. By coupling real build environments with verifiable outcomes, Build-bench establishes the first architecture-aware benchmark for studying LLM-based software build and repair.
