Table of Contents
Fetching ...

SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications?

Muxin Tian, Zhe Wang, Blair Yang, Zhenwei Tang, Kunlun Zhu, Honghua Dong, Hanchen Li, Xinni Xie, Guangjing Wang, Jiaxuan You

TL;DR

SWE-Bench Mobile tackles the question of whether large language model agents can develop industry-grade mobile applications by introducing a hosted, multi-modal benchmark built from real production artifacts. The benchmark combines PRDs, Figma designs, and a large Swift/Objective-C codebase across 50 tasks with 449 test cases, evaluated via patch-based tests to reflect real-world workflows. Across 22 agent–model configurations and four coding agents, the best task success is only $12\%$, with test pass rates up to $28.1\%$, revealing a substantial gap between current capabilities and industrial requirements; results also show that agent design, commercial tooling, and defensive prompting significantly influence performance, with a $6\times$ gap observed across agents using the same model. The work provides actionable guidance for practitioners and researchers, highlights the critical need for improved multi-modal grounding and cross-file reasoning, and releases a hosted evaluation platform with a public leaderboard to accelerate progress toward industry-level autonomous development.

Abstract

Can large language model agents develop industry-level mobile applications? We introduce \textbf{SWE-Bench Mobile}, a benchmark for evaluating coding agents on realistic software engineering tasks derived from a production iOS codebase. Unlike existing benchmarks that focus on isolated problems or bug fixes, SWE-Bench Mobile captures the full complexity of industrial development: multi-modal inputs (PRDs and Figma designs), a large-scale mixed Swift/Objective-C codebase, and comprehensive test suites. We evaluate 22 agent-model configurations across four coding agents -- three commercial (Cursor, Codex, Claude Code) and one open-source (OpenCode) -- and find that even the best configurations achieve only 12\% task success rate. Our analysis reveals that (1) agent design matters as much as model capability -- the same model shows up to 6$\times$ performance gap across agents, (2) commercial agents consistently outperform open-source alternatives, and (3) simple ``Defensive Programming'' prompts outperform complex ones by 7.4\%. These findings highlight a significant gap between current agent capabilities and industrial requirements, while providing actionable insights for practitioners and researchers. We release SWE-Bench Mobile as a \textit{hosted benchmark challenge} to prevent data contamination and ensure fair evaluation. The public leaderboard and development toolkit are available at https://swebenchmobile.com.

SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications?

TL;DR

SWE-Bench Mobile tackles the question of whether large language model agents can develop industry-grade mobile applications by introducing a hosted, multi-modal benchmark built from real production artifacts. The benchmark combines PRDs, Figma designs, and a large Swift/Objective-C codebase across 50 tasks with 449 test cases, evaluated via patch-based tests to reflect real-world workflows. Across 22 agent–model configurations and four coding agents, the best task success is only , with test pass rates up to , revealing a substantial gap between current capabilities and industrial requirements; results also show that agent design, commercial tooling, and defensive prompting significantly influence performance, with a gap observed across agents using the same model. The work provides actionable guidance for practitioners and researchers, highlights the critical need for improved multi-modal grounding and cross-file reasoning, and releases a hosted evaluation platform with a public leaderboard to accelerate progress toward industry-level autonomous development.

Abstract

Can large language model agents develop industry-level mobile applications? We introduce \textbf{SWE-Bench Mobile}, a benchmark for evaluating coding agents on realistic software engineering tasks derived from a production iOS codebase. Unlike existing benchmarks that focus on isolated problems or bug fixes, SWE-Bench Mobile captures the full complexity of industrial development: multi-modal inputs (PRDs and Figma designs), a large-scale mixed Swift/Objective-C codebase, and comprehensive test suites. We evaluate 22 agent-model configurations across four coding agents -- three commercial (Cursor, Codex, Claude Code) and one open-source (OpenCode) -- and find that even the best configurations achieve only 12\% task success rate. Our analysis reveals that (1) agent design matters as much as model capability -- the same model shows up to 6 performance gap across agents, (2) commercial agents consistently outperform open-source alternatives, and (3) simple ``Defensive Programming'' prompts outperform complex ones by 7.4\%. These findings highlight a significant gap between current agent capabilities and industrial requirements, while providing actionable insights for practitioners and researchers. We release SWE-Bench Mobile as a \textit{hosted benchmark challenge} to prevent data contamination and ensure fair evaluation. The public leaderboard and development toolkit are available at https://swebenchmobile.com.
Paper Structure (41 sections, 1 equation, 11 figures, 11 tables)

This paper contains 41 sections, 1 equation, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Overview of the SWE-Bench Mobile pipeline. (1) Agents receive multi-modal inputs including a Product Requirement Document (PRD), Figma design, and a large-scale Swift/Objective-C codebase. (2) The agent navigates the codebase, plans the implementation, and generates code. (3) The output is a Git patch that is applied and evaluated against a comprehensive test suite.
  • Figure 2: A concrete example of a SWE-Bench Mobile task (Task 056). The agent must interpret the PRD requirements (replace interaction button with publish time label) and visual design (Figma), locate the relevant files in the codebase (FeedItemFooter.swift), and implement the changes while handling edge cases and feature configuration.
  • Figure 3: Task distribution by category (left) and difficulty (right). Each label shows the count, percentage, and average agent pass rate. UI Components (36%) dominate the benchmark, while performance drops sharply from Easy (18.5% pass) to Hard (5.8% pass).
  • Figure 4: Task Success Rate across all configurations. Best performance is 12%, achieved by Cursor + Opus/Sonnet and Codex + GLM.
  • Figure 5: Performance decreases sharply with task complexity. (a) Tasks requiring 1-2 file modifications have 18% success rate vs. 2% for 7+ files. (b) Small patches ($<$50 lines) achieve 20% success vs. 3% for large patches ($>$200 lines). Error bars show 95% confidence intervals based on binomial proportions.
  • ...and 6 more figures