SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications?
Muxin Tian, Zhe Wang, Blair Yang, Zhenwei Tang, Kunlun Zhu, Honghua Dong, Hanchen Li, Xinni Xie, Guangjing Wang, Jiaxuan You
TL;DR
SWE-Bench Mobile tackles the question of whether large language model agents can develop industry-grade mobile applications by introducing a hosted, multi-modal benchmark built from real production artifacts. The benchmark combines PRDs, Figma designs, and a large Swift/Objective-C codebase across 50 tasks with 449 test cases, evaluated via patch-based tests to reflect real-world workflows. Across 22 agent–model configurations and four coding agents, the best task success is only $12\%$, with test pass rates up to $28.1\%$, revealing a substantial gap between current capabilities and industrial requirements; results also show that agent design, commercial tooling, and defensive prompting significantly influence performance, with a $6\times$ gap observed across agents using the same model. The work provides actionable guidance for practitioners and researchers, highlights the critical need for improved multi-modal grounding and cross-file reasoning, and releases a hosted evaluation platform with a public leaderboard to accelerate progress toward industry-level autonomous development.
Abstract
Can large language model agents develop industry-level mobile applications? We introduce \textbf{SWE-Bench Mobile}, a benchmark for evaluating coding agents on realistic software engineering tasks derived from a production iOS codebase. Unlike existing benchmarks that focus on isolated problems or bug fixes, SWE-Bench Mobile captures the full complexity of industrial development: multi-modal inputs (PRDs and Figma designs), a large-scale mixed Swift/Objective-C codebase, and comprehensive test suites. We evaluate 22 agent-model configurations across four coding agents -- three commercial (Cursor, Codex, Claude Code) and one open-source (OpenCode) -- and find that even the best configurations achieve only 12\% task success rate. Our analysis reveals that (1) agent design matters as much as model capability -- the same model shows up to 6$\times$ performance gap across agents, (2) commercial agents consistently outperform open-source alternatives, and (3) simple ``Defensive Programming'' prompts outperform complex ones by 7.4\%. These findings highlight a significant gap between current agent capabilities and industrial requirements, while providing actionable insights for practitioners and researchers. We release SWE-Bench Mobile as a \textit{hosted benchmark challenge} to prevent data contamination and ensure fair evaluation. The public leaderboard and development toolkit are available at https://swebenchmobile.com.
