Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control

Georgios Papoudakis; Thomas Coste; Jianye Hao; Jun Wang; Kun Shao

Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control

Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao

TL;DR

The paper addresses sample-efficient off-policy RL for mobile app control under sparse rewards and costly simulations by introducing Succeed or Learn Slowly (SoLS), an asymmetric update scheme that learns aggressively from positive-advantage samples while regulising negative-advantage updates, augmented by Successful Transition Replay (STR). SoLS operates on top of a two-phase pipeline (SFT on AndroidControl followed by RL fine-tuning) and uses a joint loss $\mathcal{L} = \mathcal{L}_{ac} + \lambda \mathcal{L}_{cr}$, with $A(s,a) = R - V^{\pi_\theta}(s)$ guiding updates and a PPO-like containment for negative updates; STR further concentrates learning on successful transitions. On the AndroidWorld benchmark, SoLS-STR achieves 51.3% overall success, significantly outperforming GPT-4o-based prompting methods and other RL baselines, while delivering inference times around $0.9$ seconds and up to $60\x$ speedups relative to multi-pass prompting pipelines. This demonstrates that small, finely-tuned language models can surpass larger prompting-based systems in real-world mobile app control tasks, offering a practical path toward efficient, resource-conscious AI agents in interactive environments.

Abstract

Reinforcement learning (RL) using foundation models for policy approximations in multi-turn tasks remains challenging. We identify two main limitations related to sparse reward settings and policy gradient updates, based on which we formulate a key insight: updates from positive samples with high returns typically do not require policy regularisation, whereas updates from negative samples, reflecting undesirable behaviour, can harm model performance. This paper introduces Succeed or Learn Slowly (SoLS), a novel off-policy RL algorithm evaluated on mobile app control tasks. SoLS improves sample efficiency when fine-tuning foundation models for user interface navigation via a modified off-policy actor-critic approach, applying direct policy updates for positive samples and conservative, regularised updates for negative ones to prevent model degradation. We augment SoLS with Successful Transition Replay (STR), which prioritises learning from successful interactions, further improving sample efficiency. We evaluate SoLS on the AndroidWorld benchmark, where it significantly outperforms existing methods (at least 17% relative increase), including prompt-engineering and RL approaches, while requiring substantially fewer computational resources than GPT-4o-based methods with 5-60x faster inference.

Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control

TL;DR

Abstract

Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)