Table of Contents
Fetching ...

Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems

Chenyu Zhao, Shenglin Zhang, Zeshun Huang, Weilin Jin, Yongqian Sun, Dan Pei, Chaoyun Zhang, Qingwei Lin, Chetan Bansal, Saravan Rajmohan, Minghua Ma

TL;DR

Can Language Models Go Beyond Coding? presents Build-bench, the first executable, architecture-aware benchmark for evaluating whether LLMs can diagnose, repair, and verify software build failures during cross-ISA migration. By coupling real build environments (OBS) with an iterative, tool-augmented repair loop mediated by MCP, the study assesses six LLMs across 268 real-world failures from x86_64 to aarch64 and back, revealing that iterative feedback substantially improves repair success while long build logs and multi-file dependencies remain key bottlenecks. GPT-5 leads in forward repairs with a peak around 63.19% and demonstrates how tool orchestration and fine-grained prompts influence outcomes, though overall robustness across architectures is still limited. The work delivers a practical framework and quantitative baselines to guide future research on LLM-assisted cross-architecture software maintenance and automated build repair.

Abstract

Large language models (LLMs) have shown growing potential in software engineering, yet few benchmarks evaluate their ability to repair software during migration across instruction set architectures (ISAs). Cross-ISA migration, such as between x86_64 and aarch64, requires handling complex dependencies, heterogeneous toolchains, and long build logs while ensuring executable verification. To address this challenge, we present Build-bench, an end-to-end benchmark that systematically evaluates the capability of LLMs to repair build failures in cross-ISA settings. Build-bench collects 268 real-world failed packages and integrates auxiliary tools including Structure Extraction, File Content Extraction, Content Modification, and Build Verification to support autonomous, tool-augmented reasoning. The repair process operates in an iterative loop where, upon failure, the model receives updated build logs and previous repair outcomes to refine subsequent attempts. Through a comparative evaluation of six representative LLMs, Build-bench reveals that current models achieve a maximum build success rate of 63% and tool usage patterns differ significantly across models. By coupling real build environments with verifiable outcomes, Build-bench establishes the first architecture-aware benchmark for studying LLM-based software build and repair.

Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems

TL;DR

Can Language Models Go Beyond Coding? presents Build-bench, the first executable, architecture-aware benchmark for evaluating whether LLMs can diagnose, repair, and verify software build failures during cross-ISA migration. By coupling real build environments (OBS) with an iterative, tool-augmented repair loop mediated by MCP, the study assesses six LLMs across 268 real-world failures from x86_64 to aarch64 and back, revealing that iterative feedback substantially improves repair success while long build logs and multi-file dependencies remain key bottlenecks. GPT-5 leads in forward repairs with a peak around 63.19% and demonstrates how tool orchestration and fine-grained prompts influence outcomes, though overall robustness across architectures is still limited. The work delivers a practical framework and quantitative baselines to guide future research on LLM-assisted cross-architecture software maintenance and automated build repair.

Abstract

Large language models (LLMs) have shown growing potential in software engineering, yet few benchmarks evaluate their ability to repair software during migration across instruction set architectures (ISAs). Cross-ISA migration, such as between x86_64 and aarch64, requires handling complex dependencies, heterogeneous toolchains, and long build logs while ensuring executable verification. To address this challenge, we present Build-bench, an end-to-end benchmark that systematically evaluates the capability of LLMs to repair build failures in cross-ISA settings. Build-bench collects 268 real-world failed packages and integrates auxiliary tools including Structure Extraction, File Content Extraction, Content Modification, and Build Verification to support autonomous, tool-augmented reasoning. The repair process operates in an iterative loop where, upon failure, the model receives updated build logs and previous repair outcomes to refine subsequent attempts. Through a comparative evaluation of six representative LLMs, Build-bench reveals that current models achieve a maximum build success rate of 63% and tool usage patterns differ significantly across models. By coupling real build environments with verifiable outcomes, Build-bench establishes the first architecture-aware benchmark for studying LLM-based software build and repair.

Paper Structure

This paper contains 33 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of different large language models (LLMs) in cross-ISA build repair tasks. (a) shows the success rates (%) achieved on four migration scenarios (x86_64$\rightarrow$aarch64 (F), x86_64$\rightarrow$aarch64 (P), aarch64$\rightarrow$x86_64 (F), and aarch64$\rightarrow$x86_64 (P)), where F denotes Full File Generation and P denotes Patch Generation. (b) summarizes the overall success rates across all tasks for each model.
  • Figure 2: The automatic cross-ISA repair and build pipeline of Build-bench. If the build fails and the maximum iteration $N_{\max}=3$ is not reached, the process repeats with the updated build log as well as the previous repair content.
  • Figure 3: Comparison of tool invocation behavior across LLMs. The bars represent the total number of invocations for each tool per LLM, while the gray line indicates the average number of tool calls per iteration.
  • Figure 4: Comparison of two repair strategies (Full File Generation vs. Patch Generation) across six LLMs and two architecture migration directions. The upper row reports the Build Success Rate, while the lower row presents Efficiency in terms of Average Repair Time (min) and Average Token Consumption (K).
  • Figure 5: Iterative repair process of the texmath package migration.