Table of Contents
Fetching ...

Intention-Conditioned Flow Occupancy Models

Chongyi Zheng, Seohong Park, Sergey Levine, Benjamin Eysenbach

TL;DR

Intention-Conditioned Flow Occupancy Models (InFOM) tackle RL pre-training from reward-free, heterogeneous data by learning a variational intention encoder and an expressive flow-based occupancy model to forecast intention-conditioned future states. The framework uses SARSA-style flow matching to model long-horizon dynamics, and then applies an implicit generalized policy improvement (via expectile Q-distillation) to extract robust policies for downstream tasks. Empirical results across 36 state-based and 4 image-based benchmarks show InFOM yields up to 1.8× median improvements in returns and 36% higher success rates versus strong baselines, with latent intentions aligning with ground-truth behaviors. The work demonstrates that combining intention-conditioned occupancy modeling with flow-based generative modeling enables efficient, scalable pre-training for RL and practical improvements in downstream adaptation.

Abstract

Large-scale pre-training has fundamentally changed how machine learning research is done today: large foundation models are trained once, and then can be used by anyone in the community (including those without data or compute resources to train a model from scratch) to adapt and fine-tune to specific tasks. Applying this same framework to reinforcement learning (RL) is appealing because it offers compelling avenues for addressing core challenges in RL, including sample efficiency and robustness. However, there remains a fundamental challenge to pre-train large models in the context of RL: actions have long-term dependencies, so training a foundation model that reasons across time is important. Recent advances in generative AI have provided new tools for modeling highly complex distributions. In this paper, we build a probabilistic model to predict which states an agent will visit in the temporally distant future (i.e., an occupancy measure) using flow matching. As large datasets are often constructed by many distinct users performing distinct tasks, we include in our model a latent variable capturing the user intention. This intention increases the expressivity of our model, and enables adaptation with generalized policy improvement. We call our proposed method intention-conditioned flow occupancy models (InFOM). Comparing with alternative methods for pre-training, our experiments on $36$ state-based and $4$ image-based benchmark tasks demonstrate that the proposed method achieves $1.8 \times$ median improvement in returns and increases success rates by $36\%$. Website: https://chongyi-zheng.github.io/infom Code: https://github.com/chongyi-zheng/infom

Intention-Conditioned Flow Occupancy Models

TL;DR

Intention-Conditioned Flow Occupancy Models (InFOM) tackle RL pre-training from reward-free, heterogeneous data by learning a variational intention encoder and an expressive flow-based occupancy model to forecast intention-conditioned future states. The framework uses SARSA-style flow matching to model long-horizon dynamics, and then applies an implicit generalized policy improvement (via expectile Q-distillation) to extract robust policies for downstream tasks. Empirical results across 36 state-based and 4 image-based benchmarks show InFOM yields up to 1.8× median improvements in returns and 36% higher success rates versus strong baselines, with latent intentions aligning with ground-truth behaviors. The work demonstrates that combining intention-conditioned occupancy modeling with flow-based generative modeling enables efficient, scalable pre-training for RL and practical improvements in downstream adaptation.

Abstract

Large-scale pre-training has fundamentally changed how machine learning research is done today: large foundation models are trained once, and then can be used by anyone in the community (including those without data or compute resources to train a model from scratch) to adapt and fine-tune to specific tasks. Applying this same framework to reinforcement learning (RL) is appealing because it offers compelling avenues for addressing core challenges in RL, including sample efficiency and robustness. However, there remains a fundamental challenge to pre-train large models in the context of RL: actions have long-term dependencies, so training a foundation model that reasons across time is important. Recent advances in generative AI have provided new tools for modeling highly complex distributions. In this paper, we build a probabilistic model to predict which states an agent will visit in the temporally distant future (i.e., an occupancy measure) using flow matching. As large datasets are often constructed by many distinct users performing distinct tasks, we include in our model a latent variable capturing the user intention. This intention increases the expressivity of our model, and enables adaptation with generalized policy improvement. We call our proposed method intention-conditioned flow occupancy models (InFOM). Comparing with alternative methods for pre-training, our experiments on state-based and image-based benchmark tasks demonstrate that the proposed method achieves median improvement in returns and increases success rates by . Website: https://chongyi-zheng.github.io/infom Code: https://github.com/chongyi-zheng/infom

Paper Structure

This paper contains 49 sections, 20 equations, 18 figures, 4 tables, 3 algorithms.

Figures (18)

  • Figure 1: InFOM is a latent variable model for pre-training and fine-tuning in reinforcement learning. (Left) The datasets are collected by users performing distinct tasks. (Center) We encode intentions by maximizing an evidence lower bound of data likelihood, (Right) enabling intention-aware future prediction using flow matching. See Sec. \ref{['sec:method']} for details.
  • Figure 2: Domains for evaluation.(Left) ExORL domains (16 state-based tasks). (Right) OGBench domains (20 state-based tasks and 4 image-based tasks).
  • Figure 3: Evaluation on ExORL and OGBench tasks. We compare InFOM against prior methods that utilize various learning paradigms on task-agnostic pre-training and task-specific fine-tuning. InFOM performs similarly to, if not better than, prior methods on 7 out of the 9 domains, including the most challenging visual tasks. We report means and standard deviations over 8 random seeds (4 random seeds for image-based tasks) with error bars indicating one standard deviation. See Table \ref{['tab:offline2offline-eval']} for full results.
  • Figure 4: Visualization of latent intentions.(Top) The optimal policy picks up the blue block from the left and places it on the right. (Bottom) Using t-SNE maaten2008visualizing, we visualize the latent intentions inferred by the variational intention encoder in InFOM, comparing against latent representations inferred by HILP and FB for learning FOMs. The predictions from InFOM align with the underlying intentions. See Sec. \ref{['subsec:visualizing-latent-intentions']} for details and Appendix \ref{['appendix:visualizing-latent-intentions']} for more visualizations.
  • Figure 5: Comparison to alternative policy extraction strategies. We compare InFOM to alternative policy extraction strategies based on the standard generalized policy improvement or one-step policy improvement. Our method is $44\%$ more performant with $8 \times$ smaller variance than the variant using the standard GPI. See Sec. \ref{['subsec:policy-extraction-ablation']} for details.
  • ...and 13 more figures