Scaling Mobile Chaos Testing with AI-Driven Test Execution

Juan Marcano; Ashish Samant; Kai Song; Lingchao Chen; Kaelan Mikowicz; Tim Smyth; Mengdie Zhang; Ali Zamani; Arturo Bravo Rovirosa; Sowjanya Puligadda; Srikanth Prodduturi; Mayank Bansal

Scaling Mobile Chaos Testing with AI-Driven Test Execution

Juan Marcano, Ashish Samant, Kai Song, Lingchao Chen, Kaelan Mikowicz, Tim Smyth, Mengdie Zhang, Ali Zamani, Arturo Bravo Rovirosa, Sowjanya Puligadda, Srikanth Prodduturi, Mayank Bansal

TL;DR

The paper tackles the scalability gap in mobile resilience testing for large-scale distributed apps by integrating AI-driven mobile test execution (DragonCrawl) with service-level fault injection (uHavoc). This combination addresses the combinatorial explosion of flows, locations, and failure types, enabling end-to-end resilience validation at production scale. Empirical results show over 180,000 automated chaos runs across 47 core flows with 99%+ test reliability, and 23 resilience risks identified, including several blocking trips/orders and two crashes detected only via mobile chaos testing. Automated root cause analysis reduces debugging from hours to minutes, and the program yields substantial operational benefits, including significant reductions in manual testing effort and faster issue attribution. The work demonstrates that continuous mobile resilience validation at production scale is achievable, while also outlining limitations related to infrastructure scope, tracing requirements, and organizational readiness.

Abstract

Mobile applications in large-scale distributed systems are susceptible to backend service failures, yet traditional chaos engineering approaches cannot scale mobile testing due to the combinatorial explosion of flows, locations, and failure scenarios that need validation. We present an automated mobile chaos testing system that integrates DragonCrawl, an LLM-based mobile testing platform, with uHavoc, a service-level fault injection system. The key insight is that adaptive AI-driven test execution can navigate mobile applications under degraded backend conditions, eliminating the need to manually write test cases for each combination of user flow, city, and failure type. Since Q1 2024, our system has executed over 180,000 automated chaos tests across 47 critical flows in Uber's Rider, Driver, and Eats applications, representing approximately 39,000 hours of manual testing effort that would be impractical at this scale. We identified 23 resilience risks, with 70% being architectural dependency violations where non-critical service failures degraded core user flows. Twelve issues were severe enough to prevent trip requests or food orders. Two caused application crashes detectable only through mobile chaos testing, not backend testing alone. Automated root cause analysis reduced debugging time from hours to minutes, achieving 88% precision@5 in attributing mobile failures to specific backend services. This paper presents the system design, evaluates its performance under fault injection (maintaining 99% test reliability), and reports operational experience demonstrating that continuous mobile resilience validation is achievable at production scale.

Scaling Mobile Chaos Testing with AI-Driven Test Execution

TL;DR

Abstract

Scaling Mobile Chaos Testing with AI-Driven Test Execution

Authors

TL;DR

Abstract

Table of Contents

Figures (7)