Table of Contents
Fetching ...

Advancing Autonomous VLM Agents via Variational Subgoal-Conditioned Reinforcement Learning

Qingyuan Wu, Jianheng Liu, Jianye Hao, Jun Wang, Kun Shao

TL;DR

The paper tackles the inefficiency of vision-language model (VLM) agents in complex, sparse-reward decision tasks. It introduces Variational Subgoal-Conditioned RL (VSC-RL), which reframes decision making as a variational subgoal-conditioned problem and derives the SubGoal-Conditioned ELBO ($\text{SGC-ELBO}$) that combines subgoal-conditioned return maximization with divergence minimization to a reference policy. A key contribution is enabling autonomous subgoal generation via Vision-Language Models and integrating this with an efficient optimization scheme (AWR plus imitation loss) to maximize $\text{SGC-ELBO}$. The approach yields significant improvements in learning efficiency and final performance across mobile device and web control benchmarks (e.g., AitW General, Web Shopping, WebArena-Lite), while offering theoretical guarantees and clear ablation-driven insights into the value of subgoals and components. Overall, VSC-RL provides a scalable framework for leveraging VLMs to guide long-horizon decision making in real-world multimodal environments, with strong potential for broader embodied and robotic applications.

Abstract

State-of-the-art (SOTA) reinforcement learning (RL) methods have enabled vision-language model (VLM) agents to learn from interaction with online environments without human supervision. However, these methods often struggle with learning inefficiencies when applied to complex, real-world decision-making tasks with sparse rewards and long-horizon dependencies. We propose a novel framework, Variational Subgoal-Conditioned Reinforcement Learning (VSC-RL), advancing the VLM agents in resolving challenging decision-making tasks. Fundamentally distinct from existing methods, VSC-RL reformulates the decision-making problem as a variational subgoal-conditioned RL problem with the newly derived optimization objective, Subgoal Evidence Lower BOund (SGC-ELBO), which comprises two key components: (a) maximizing the subgoal-conditioned return, and (b) minimizing the divergence from a reference goal-conditioned policy. We theoretically and empirically demonstrate that the VSC-RL can efficiently improve the learning efficiency without compromising performance guarantees. Across a diverse set of challenging benchmarks, including mobile device and web control tasks, VSC-RL consistently outperforms existing SOTA methods, achieving superior learning efficiency and performance.

Advancing Autonomous VLM Agents via Variational Subgoal-Conditioned Reinforcement Learning

TL;DR

The paper tackles the inefficiency of vision-language model (VLM) agents in complex, sparse-reward decision tasks. It introduces Variational Subgoal-Conditioned RL (VSC-RL), which reframes decision making as a variational subgoal-conditioned problem and derives the SubGoal-Conditioned ELBO () that combines subgoal-conditioned return maximization with divergence minimization to a reference policy. A key contribution is enabling autonomous subgoal generation via Vision-Language Models and integrating this with an efficient optimization scheme (AWR plus imitation loss) to maximize . The approach yields significant improvements in learning efficiency and final performance across mobile device and web control benchmarks (e.g., AitW General, Web Shopping, WebArena-Lite), while offering theoretical guarantees and clear ablation-driven insights into the value of subgoals and components. Overall, VSC-RL provides a scalable framework for leveraging VLMs to guide long-horizon decision making in real-world multimodal environments, with strong potential for broader embodied and robotic applications.

Abstract

State-of-the-art (SOTA) reinforcement learning (RL) methods have enabled vision-language model (VLM) agents to learn from interaction with online environments without human supervision. However, these methods often struggle with learning inefficiencies when applied to complex, real-world decision-making tasks with sparse rewards and long-horizon dependencies. We propose a novel framework, Variational Subgoal-Conditioned Reinforcement Learning (VSC-RL), advancing the VLM agents in resolving challenging decision-making tasks. Fundamentally distinct from existing methods, VSC-RL reformulates the decision-making problem as a variational subgoal-conditioned RL problem with the newly derived optimization objective, Subgoal Evidence Lower BOund (SGC-ELBO), which comprises two key components: (a) maximizing the subgoal-conditioned return, and (b) minimizing the divergence from a reference goal-conditioned policy. We theoretically and empirically demonstrate that the VSC-RL can efficiently improve the learning efficiency without compromising performance guarantees. Across a diverse set of challenging benchmarks, including mobile device and web control tasks, VSC-RL consistently outperforms existing SOTA methods, achieving superior learning efficiency and performance.

Paper Structure

This paper contains 41 sections, 4 theorems, 20 equations, 15 figures, 7 tables, 1 algorithm.

Key Result

Proposition 4.1

Given a goal $g$ with corresponding subgoals $\{sg_i\}_{i=1}^N$ and a subgoal-conditioned target policy $\pi$, the objective of is equivalent to the objective of

Figures (15)

  • Figure 1: The pipeline of VSC-RL. (a) VLM autonomously decomposes the goal $g$ to the subgoals $\{sg_i\}_{i=1}^N$. VSC-RL optimizes the objective of $\text{SGC-ELBO}$ consisting of (b) maximizing the subgoal-conditioned return and (c) minimizing the subgoal-conditioned difference.
  • Figure 2: Autonomous subgoal generation in AitW task. The VLM autonomously decomposes the goal of the complicated mobile device control task into easily achievable subgoals.
  • Figure 3: Learning curves on AitW (a) General and (b) Web Shopping tasks.
  • Figure 4: Learning curves on WebArena-Lite.
  • Figure 5: Learning curves on MultiRoom tasks of (a) 2 rooms, (b) 4 rooms, and (c) 6 rooms.
  • ...and 10 more figures

Theorems & Definitions (6)

  • Proposition 4.1: Subgoal-Conditioned Optimization Objective, Proof in \ref{['appendix:soo']}
  • Proposition 4.2: Subgoal-conditioned Difference Bound, Proof in \ref{['appendix:sdb']}
  • Proposition B.1: Subgoal-Conditioned Optimization Objective
  • proof
  • Proposition B.2: Subgoal-conditioned Difference Bound
  • proof