Table of Contents
Fetching ...

Digi-Q: Learning Q-Value Functions for Training Device-Control Agents

Hao Bai, Yifei Zhou, Li Erran Li, Sergey Levine, Aviral Kumar

TL;DR

Digi-Q addresses the challenge of training device-control agents in dynamic environments by learning a Q-function on top of tuned Vision-Language Model representations from offline data. It stabilizes off-policy value learning via a representation-fine-tuning phase and TD-learning, then improves policy via a Best-of-N reranking objective that imitates the best action according to the learned Q-values. Empirically, Digi-Q substantially outperforms prompting-based baselines and prior offline methods on Android-in-the-Wild benchmarks, with notable data efficiency and, in some cases, parity with on-policy RL. The approach emphasizes offline data reuse, computational efficiency, and robust policy improvement without environment interaction, and it opens avenues for extending Q-based policy learning to real-world GUI tasks.

Abstract

While a number of existing approaches for building foundation model agents rely on prompting or fine-tuning with human demonstrations, it is not sufficient in dynamic environments (e.g., mobile device control). On-policy reinforcement learning (RL) should address these limitations, but collecting actual rollouts in an environment is often undesirable in truly open-ended agentic problems such as mobile device control or interacting with humans, where each unit of interaction is associated with a cost. In such scenarios, a method for policy learning that can utilize off-policy experience by learning a trained action-value function is much more effective. In this paper, we develop an approach, called Digi-Q, to train VLM-based action-value Q-functions which are then used to extract the agent policy. We study our approach in the mobile device control setting. Digi-Q trains the Q-function using offline temporal-difference (TD) learning, on top of frozen, intermediate-layer features of a VLM. Compared to fine-tuning the whole VLM, this approach saves us compute and enhances scalability. To make the VLM features amenable for representing the Q-function, we need to employ an initial phase of fine-tuning to amplify coverage over actionable information needed for value function. Once trained, we use this Q-function via a Best-of-N policy extraction operator that imitates the best action out of multiple candidate actions from the current policy as ranked by the value function, enabling policy improvement without environment interaction. Digi-Q outperforms several prior methods on user-scale device control tasks in Android-in-the-Wild, attaining 21.2% improvement over prior best-performing method. In some cases, our Digi-Q approach already matches state-of-the-art RL methods that require interaction. The project is open-sourced at https://github.com/DigiRL-agent/digiq

Digi-Q: Learning Q-Value Functions for Training Device-Control Agents

TL;DR

Digi-Q addresses the challenge of training device-control agents in dynamic environments by learning a Q-function on top of tuned Vision-Language Model representations from offline data. It stabilizes off-policy value learning via a representation-fine-tuning phase and TD-learning, then improves policy via a Best-of-N reranking objective that imitates the best action according to the learned Q-values. Empirically, Digi-Q substantially outperforms prompting-based baselines and prior offline methods on Android-in-the-Wild benchmarks, with notable data efficiency and, in some cases, parity with on-policy RL. The approach emphasizes offline data reuse, computational efficiency, and robust policy improvement without environment interaction, and it opens avenues for extending Q-based policy learning to real-world GUI tasks.

Abstract

While a number of existing approaches for building foundation model agents rely on prompting or fine-tuning with human demonstrations, it is not sufficient in dynamic environments (e.g., mobile device control). On-policy reinforcement learning (RL) should address these limitations, but collecting actual rollouts in an environment is often undesirable in truly open-ended agentic problems such as mobile device control or interacting with humans, where each unit of interaction is associated with a cost. In such scenarios, a method for policy learning that can utilize off-policy experience by learning a trained action-value function is much more effective. In this paper, we develop an approach, called Digi-Q, to train VLM-based action-value Q-functions which are then used to extract the agent policy. We study our approach in the mobile device control setting. Digi-Q trains the Q-function using offline temporal-difference (TD) learning, on top of frozen, intermediate-layer features of a VLM. Compared to fine-tuning the whole VLM, this approach saves us compute and enhances scalability. To make the VLM features amenable for representing the Q-function, we need to employ an initial phase of fine-tuning to amplify coverage over actionable information needed for value function. Once trained, we use this Q-function via a Best-of-N policy extraction operator that imitates the best action out of multiple candidate actions from the current policy as ranked by the value function, enabling policy improvement without environment interaction. Digi-Q outperforms several prior methods on user-scale device control tasks in Android-in-the-Wild, attaining 21.2% improvement over prior best-performing method. In some cases, our Digi-Q approach already matches state-of-the-art RL methods that require interaction. The project is open-sourced at https://github.com/DigiRL-agent/digiq

Paper Structure

This paper contains 27 sections, 6 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Comparing Digi-Q with on-policy policy-gradient methods.$(s, a)$ rollout pairs that are learned are marked green in the buffer. Typically, policy-based methods utilizes a state value function to filter out promising state-action pairs, and requires online data to improve. In contrast, Digi-Q learns a state-action (Q) value function through TD-learning on offline data, and re-sample an amount of actions for each state. This Q-function is then used to rank the re-sampled action to learn a policy using the best action under each state. Digi-Q results in much higher sample efficiency than policy-based methods, thus it can be applied even in a fully offline setting.
  • Figure 2: Overview of Digi-Q.Blue arrows represent forward data flows, while red arrows represent how we get learning targets used for back propagation. Our method first goes through a representation fine-tuning stage to extract actionable features from the VLM. TD-learning is then performed on top of frozen VLM representations to learn a reliable Q value function, followed by Best-of-N policy extraction approach.
  • Figure 3: Left: Performance of Digi-Q when varying the number of actions $N$ used for policy extraction. Observe that the performance of Digi-Q improves when more actions are used for policy extraction, indicating the efficacy of our approach and the benefits of learning a Q-function. Right: Data efficiency of Digi-Q and DigiRL. The success rate of Digi-Q increases significantly faster than offline DigiRL given the same amount of more data.
  • Figure 4: Qualitative examples showing the advantage estimations of several transitions of TD (ours), Monte-Carlo, and TD without VLM representation. Advantage estimations using TD-learnt value functions top of VLM representation better align with human judgements compared to MC and TD without using VLM.
  • Figure 5: Offline critic evaluation accuracy as a function of compute measured in terms of training FLOPS, compared across Digi-Q, end-to-end TD-learning on a VLM, and MC return. Observe that the critic accuracy is much better for our approach over end-to-end TD-learning as the amount of compute increases.
  • ...and 5 more figures