Table of Contents
Fetching ...

Automating Android Build Repair: Bridging the Reasoning-Execution Gap in LLM Agents with Domain-Specific Tools

Ha Min Son, Huan Ren, Xin Liu, Zhe Zhao

TL;DR

This work introduces AndroidBuildBench, a real-world benchmark of 1,019 reproducible Android build failures from 43 projects, paired with verified fixes to enable feasible repairs. It then presents GradleFixer, an LLM agent equipped with domain-specific tools that wrap Gradle-related commands, demonstrating that Tool Bridging—replacing general shells with API-like abstractions—significantly improves repair success (pass@1 around 81–84%) compared to baselines. Key findings show domain-aware tooling yields higher repair rates, smaller models can be cost-effective with specialized tools, and larger change magnitudes remain a major challenge. The results have practical implications for automating Android builds and suggest that applying domain-specific tooling could generalize to other software domains, enabling faster, cheaper, and more reliable automated repair.

Abstract

Android is the largest mobile platform, yet automatically building applications remains a practical challenge. While Large Language Models (LLMs) show promise for code repair, their use for fixing Android build errors remains underexplored. To address this gap, we first introduce AndroidBuildBench, a benchmark of 1,019 build failures curated from the commit histories of 43 open-source Android projects. Each problem is paired with a verified solution from a subsequent commit, ensuring that fixes are feasible. Second, we propose GradleFixer, an LLM agent with domain-specific tools for inspecting and manipulating the Gradle build environment. GradleFixer achieves a resolve rate of 81.4% (pass@1), significantly outperforming a state-of-the-art coding agent that relies on a general-purpose shell. GradleFixer's success suggests that while LLMs possess the high-level knowledge to solve these failures, they struggle to translate this knowledge into effective low-level actions using a general-purpose shell. We demonstrate the effectiveness of a strategy we term Tool Bridging, which replaces general-purpose shell commands with domain-aware abstractions. We hypothesize this approach works through two mechanisms: 1) it provides tools in an API-like format that LLMs use more reliably, and 2) it constrains the action space to relevant operations. This approach bridges the gap between the model's high-level reasoning and effective low-level execution.

Automating Android Build Repair: Bridging the Reasoning-Execution Gap in LLM Agents with Domain-Specific Tools

TL;DR

This work introduces AndroidBuildBench, a real-world benchmark of 1,019 reproducible Android build failures from 43 projects, paired with verified fixes to enable feasible repairs. It then presents GradleFixer, an LLM agent equipped with domain-specific tools that wrap Gradle-related commands, demonstrating that Tool Bridging—replacing general shells with API-like abstractions—significantly improves repair success (pass@1 around 81–84%) compared to baselines. Key findings show domain-aware tooling yields higher repair rates, smaller models can be cost-effective with specialized tools, and larger change magnitudes remain a major challenge. The results have practical implications for automating Android builds and suggest that applying domain-specific tooling could generalize to other software domains, enabling faster, cheaper, and more reliable automated repair.

Abstract

Android is the largest mobile platform, yet automatically building applications remains a practical challenge. While Large Language Models (LLMs) show promise for code repair, their use for fixing Android build errors remains underexplored. To address this gap, we first introduce AndroidBuildBench, a benchmark of 1,019 build failures curated from the commit histories of 43 open-source Android projects. Each problem is paired with a verified solution from a subsequent commit, ensuring that fixes are feasible. Second, we propose GradleFixer, an LLM agent with domain-specific tools for inspecting and manipulating the Gradle build environment. GradleFixer achieves a resolve rate of 81.4% (pass@1), significantly outperforming a state-of-the-art coding agent that relies on a general-purpose shell. GradleFixer's success suggests that while LLMs possess the high-level knowledge to solve these failures, they struggle to translate this knowledge into effective low-level actions using a general-purpose shell. We demonstrate the effectiveness of a strategy we term Tool Bridging, which replaces general-purpose shell commands with domain-aware abstractions. We hypothesize this approach works through two mechanisms: 1) it provides tools in an API-like format that LLMs use more reliably, and 2) it constrains the action space to relevant operations. This approach bridges the gap between the model's high-level reasoning and effective low-level execution.

Paper Structure

This paper contains 34 sections, 3 figures, 11 tables.

Figures (3)

  • Figure 1: The pass@k resolve rates (percentage of problems solved within k independent sampling attempts) for different agent frameworks on our test set of 184 build errors. We find that replacing the general shell with domain-specific tools significantly improves performance.
  • Figure 2: The pass@k resolve rates with a comparison for model sizes. We find that our method, using a smaller model outperforms Gemini-CLI using a larger model, supporting the importance of domain-specific tools on performance.
  • Figure 3: Distribution of problems by the number of lines changed, plotted on a logarithmic x-axis. The histogram shows that while most problems involve small changes, the benchmark contains a long tail of complex problems that had changes to hundreds or thousands of lines.