Advancing Autonomous VLM Agents via Variational Subgoal-Conditioned Reinforcement Learning
Qingyuan Wu, Jianheng Liu, Jianye Hao, Jun Wang, Kun Shao
TL;DR
The paper tackles the inefficiency of vision-language model (VLM) agents in complex, sparse-reward decision tasks. It introduces Variational Subgoal-Conditioned RL (VSC-RL), which reframes decision making as a variational subgoal-conditioned problem and derives the SubGoal-Conditioned ELBO ($\text{SGC-ELBO}$) that combines subgoal-conditioned return maximization with divergence minimization to a reference policy. A key contribution is enabling autonomous subgoal generation via Vision-Language Models and integrating this with an efficient optimization scheme (AWR plus imitation loss) to maximize $\text{SGC-ELBO}$. The approach yields significant improvements in learning efficiency and final performance across mobile device and web control benchmarks (e.g., AitW General, Web Shopping, WebArena-Lite), while offering theoretical guarantees and clear ablation-driven insights into the value of subgoals and components. Overall, VSC-RL provides a scalable framework for leveraging VLMs to guide long-horizon decision making in real-world multimodal environments, with strong potential for broader embodied and robotic applications.
Abstract
State-of-the-art (SOTA) reinforcement learning (RL) methods have enabled vision-language model (VLM) agents to learn from interaction with online environments without human supervision. However, these methods often struggle with learning inefficiencies when applied to complex, real-world decision-making tasks with sparse rewards and long-horizon dependencies. We propose a novel framework, Variational Subgoal-Conditioned Reinforcement Learning (VSC-RL), advancing the VLM agents in resolving challenging decision-making tasks. Fundamentally distinct from existing methods, VSC-RL reformulates the decision-making problem as a variational subgoal-conditioned RL problem with the newly derived optimization objective, Subgoal Evidence Lower BOund (SGC-ELBO), which comprises two key components: (a) maximizing the subgoal-conditioned return, and (b) minimizing the divergence from a reference goal-conditioned policy. We theoretically and empirically demonstrate that the VSC-RL can efficiently improve the learning efficiency without compromising performance guarantees. Across a diverse set of challenging benchmarks, including mobile device and web control tasks, VSC-RL consistently outperforms existing SOTA methods, achieving superior learning efficiency and performance.
