Table of Contents
Fetching ...

Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control

Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao

TL;DR

The paper addresses sample-efficient off-policy RL for mobile app control under sparse rewards and costly simulations by introducing Succeed or Learn Slowly (SoLS), an asymmetric update scheme that learns aggressively from positive-advantage samples while regulising negative-advantage updates, augmented by Successful Transition Replay (STR). SoLS operates on top of a two-phase pipeline (SFT on AndroidControl followed by RL fine-tuning) and uses a joint loss $\mathcal{L} = \mathcal{L}_{ac} + \lambda \mathcal{L}_{cr}$, with $A(s,a) = R - V^{\pi_\theta}(s)$ guiding updates and a PPO-like containment for negative updates; STR further concentrates learning on successful transitions. On the AndroidWorld benchmark, SoLS-STR achieves 51.3% overall success, significantly outperforming GPT-4o-based prompting methods and other RL baselines, while delivering inference times around $0.9$ seconds and up to $60\x$ speedups relative to multi-pass prompting pipelines. This demonstrates that small, finely-tuned language models can surpass larger prompting-based systems in real-world mobile app control tasks, offering a practical path toward efficient, resource-conscious AI agents in interactive environments.

Abstract

Reinforcement learning (RL) using foundation models for policy approximations in multi-turn tasks remains challenging. We identify two main limitations related to sparse reward settings and policy gradient updates, based on which we formulate a key insight: updates from positive samples with high returns typically do not require policy regularisation, whereas updates from negative samples, reflecting undesirable behaviour, can harm model performance. This paper introduces Succeed or Learn Slowly (SoLS), a novel off-policy RL algorithm evaluated on mobile app control tasks. SoLS improves sample efficiency when fine-tuning foundation models for user interface navigation via a modified off-policy actor-critic approach, applying direct policy updates for positive samples and conservative, regularised updates for negative ones to prevent model degradation. We augment SoLS with Successful Transition Replay (STR), which prioritises learning from successful interactions, further improving sample efficiency. We evaluate SoLS on the AndroidWorld benchmark, where it significantly outperforms existing methods (at least 17% relative increase), including prompt-engineering and RL approaches, while requiring substantially fewer computational resources than GPT-4o-based methods with 5-60x faster inference.

Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control

TL;DR

The paper addresses sample-efficient off-policy RL for mobile app control under sparse rewards and costly simulations by introducing Succeed or Learn Slowly (SoLS), an asymmetric update scheme that learns aggressively from positive-advantage samples while regulising negative-advantage updates, augmented by Successful Transition Replay (STR). SoLS operates on top of a two-phase pipeline (SFT on AndroidControl followed by RL fine-tuning) and uses a joint loss , with guiding updates and a PPO-like containment for negative updates; STR further concentrates learning on successful transitions. On the AndroidWorld benchmark, SoLS-STR achieves 51.3% overall success, significantly outperforming GPT-4o-based prompting methods and other RL baselines, while delivering inference times around seconds and up to speedups relative to multi-pass prompting pipelines. This demonstrates that small, finely-tuned language models can surpass larger prompting-based systems in real-world mobile app control tasks, offering a practical path toward efficient, resource-conscious AI agents in interactive environments.

Abstract

Reinforcement learning (RL) using foundation models for policy approximations in multi-turn tasks remains challenging. We identify two main limitations related to sparse reward settings and policy gradient updates, based on which we formulate a key insight: updates from positive samples with high returns typically do not require policy regularisation, whereas updates from negative samples, reflecting undesirable behaviour, can harm model performance. This paper introduces Succeed or Learn Slowly (SoLS), a novel off-policy RL algorithm evaluated on mobile app control tasks. SoLS improves sample efficiency when fine-tuning foundation models for user interface navigation via a modified off-policy actor-critic approach, applying direct policy updates for positive samples and conservative, regularised updates for negative ones to prevent model degradation. We augment SoLS with Successful Transition Replay (STR), which prioritises learning from successful interactions, further improving sample efficiency. We evaluate SoLS on the AndroidWorld benchmark, where it significantly outperforms existing methods (at least 17% relative increase), including prompt-engineering and RL approaches, while requiring substantially fewer computational resources than GPT-4o-based methods with 5-60x faster inference.

Paper Structure

This paper contains 35 sections, 9 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Illustration of the SoLS and STR methods.
  • Figure 2: Left: Bar plot presenting the average success rate and two standard errors of the mean for PPO, DigiRL and SOLS with and without STR. Right: Scatter plot illustrating the trade-off between success rate and inference time. The most desirable location is in the bottom-right, demonstrating strong success rate and low inference time, which SoLS occupies.
  • Figure 3: SoLS success rate comparison at the beginning and end of training, by task category.
  • Figure 4: Input to SoLS and other RL methods.
  • Figure 5: Pie charts comparing task difficulty distribution between the full AndroidWorld benchmark, and the task subset used in this work.
  • ...and 3 more figures