Table of Contents
Fetching ...

MobileDev-Bench: A Comprehensive Benchmark for Evaluating Language Models on Mobile Application Development

Moshood A. Fakorede, Krishna Upadhyay, A. B. Siddique, Umar Farooq

Abstract

Large language models (LLMs) have shown strong performance on automated software engineering tasks, yet existing benchmarks focus primarily on general-purpose libraries or web applications, leaving mobile application development largely unexplored despite its strict platform constraints, framework-driven lifecycles, and complex platform API interactions. We introduce MobileDev-Bench, a benchmark comprising 384 real-world issue-resolution tasks collected from 18 production mobile applications spanning Android Native (Java/Kotlin), React Native (TypeScript), and Flutter (Dart). Each task pairs an authentic developer-reported issue with executable test patches, enabling fully automated validation of model-generated fixes within mobile build environments. The benchmark exhibits substantial patch complexity: fixes modify 12.5 files and 324.9 lines on average, and 35.7% of instances require coordinated changes across multiple artifact types, such as source and manifest files. Evaluation of four state-of-the-art code-capable LLMs, GPT- 5.2, Claude Sonnet 4.5, Gemini Flash 2.5, and Qwen3-Coder, yields low end-to-end resolution rates of 3.39%-5.21%, revealing significant performance gaps compared to prior benchmarks. Further analysis reveals systematic failure modes, with fault localization across multi-file and multi-artifact changes emerging as the primary bottleneck.

MobileDev-Bench: A Comprehensive Benchmark for Evaluating Language Models on Mobile Application Development

Abstract

Large language models (LLMs) have shown strong performance on automated software engineering tasks, yet existing benchmarks focus primarily on general-purpose libraries or web applications, leaving mobile application development largely unexplored despite its strict platform constraints, framework-driven lifecycles, and complex platform API interactions. We introduce MobileDev-Bench, a benchmark comprising 384 real-world issue-resolution tasks collected from 18 production mobile applications spanning Android Native (Java/Kotlin), React Native (TypeScript), and Flutter (Dart). Each task pairs an authentic developer-reported issue with executable test patches, enabling fully automated validation of model-generated fixes within mobile build environments. The benchmark exhibits substantial patch complexity: fixes modify 12.5 files and 324.9 lines on average, and 35.7% of instances require coordinated changes across multiple artifact types, such as source and manifest files. Evaluation of four state-of-the-art code-capable LLMs, GPT- 5.2, Claude Sonnet 4.5, Gemini Flash 2.5, and Qwen3-Coder, yields low end-to-end resolution rates of 3.39%-5.21%, revealing significant performance gaps compared to prior benchmarks. Further analysis reveals systematic failure modes, with fault localization across multi-file and multi-artifact changes emerging as the primary bottleneck.

Paper Structure

This paper contains 51 sections, 3 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: Overview of MobileDev-Bench construction pipeline.
  • Figure 2: Distribution of difficulty tiers and task categories in MobileDev-Bench. Difficulty tiers are approximately balanced, while bug fixes constitute the largest share of tasks.
  • Figure 3: Distribution of patch complexity metrics (files modified, hunks, changed lines) across five benchmarks (SWE-PB: SWE-PolyBench; SWE-MM: SWE-bench Multimodal; Multi-SWE: Multi-SWE-bench; Rust-SWE: Rust-SWE-bench; Ours: MobileDev-Bench). Each data point is the per-repository average. Y-axes are capped at the 90th percentile for visual clarity.
  • Figure 4: Recall decreases sharply as the number of modified files increases.
  • Figure 5: Distribution of manual annotation scores for (a) problem-statement clarity and (b) unit-test coverage adequacy, across all 560 execution-validated candidate instances. Green bars (scores 0--1) indicate instances accepted into the benchmark; hatched red bars (scores 2--3) indicate excluded instances. The majority of exclusions arise from vague issue descriptions (score 2/3 on dimension a) or tests that impose undue implementation constraints (score 2/3 on dimension b).
  • ...and 6 more figures