Table of Contents
Fetching ...

Scalable In-Context Q-Learning

Jinmei Liu, Fuhong Liu, Zhenhong Sun, Jianye Hao, Huaxiong Li, Bo Wang, Daoyi Dong, Chunlin Chen, Zhi Wang

TL;DR

Scalable In-Context Q-Learning (S-ICQL) tackles offline multi-task reinforcement learning by integrating dynamic programming with world modeling in a prompt-based transformer. A pretrained world model constructs compact task prompts, while the transformer jointly outputs policy, state-value, and action-value heads; policy improvement proceeds via upper-expectile Q-learning and advantage-weighted regression, enabling effective stitching of suboptimal trajectories. The approach demonstrates strong, data-efficient performance across discrete and continuous domains, including OOD generalization and high-dimensional tasks, with robust ablations confirming the value of both the world model and in-context Q-learning components. This blend preserves the stability of supervised pretraining while achieving scalable reward maximization and rapid adaptation to new tasks, making it practically impactful for scalable, generalizable decision-making with limited data.

Abstract

Recent advancements in language models have demonstrated remarkable in-context learning abilities, prompting the exploration of in-context reinforcement learning (ICRL) to extend the promise to decision domains. Due to involving more complex dynamics and temporal correlations, existing ICRL approaches may face challenges in learning from suboptimal trajectories and achieving precise in-context inference. In the paper, we propose \textbf{S}calable \textbf{I}n-\textbf{C}ontext \textbf{Q}-\textbf{L}earning (\textbf{S-ICQL}), an innovative framework that harnesses dynamic programming and world modeling to steer ICRL toward efficient reward maximization and task generalization, while retaining the scalability and stability of supervised pretraining. We design a prompt-based multi-head transformer architecture that simultaneously predicts optimal policies and in-context value functions using separate heads. We pretrain a generalized world model to capture task-relevant information, enabling the construction of a compact prompt that facilitates fast and precise in-context inference. During training, we perform iterative policy improvement by fitting a state value function to an upper-expectile of the Q-function, and distill the in-context value functions into policy extraction using advantage-weighted regression. Extensive experiments across a range of discrete and continuous environments show consistent performance gains over various types of baselines, especially when learning from suboptimal data. Our code is available at \textcolor{magenta}{\href{https://github.com/NJU-RL/SICQL}{https://github.com/NJU-RL/SICQL}}.

Scalable In-Context Q-Learning

TL;DR

Scalable In-Context Q-Learning (S-ICQL) tackles offline multi-task reinforcement learning by integrating dynamic programming with world modeling in a prompt-based transformer. A pretrained world model constructs compact task prompts, while the transformer jointly outputs policy, state-value, and action-value heads; policy improvement proceeds via upper-expectile Q-learning and advantage-weighted regression, enabling effective stitching of suboptimal trajectories. The approach demonstrates strong, data-efficient performance across discrete and continuous domains, including OOD generalization and high-dimensional tasks, with robust ablations confirming the value of both the world model and in-context Q-learning components. This blend preserves the stability of supervised pretraining while achieving scalable reward maximization and rapid adaptation to new tasks, making it practically impactful for scalable, generalizable decision-making with limited data.

Abstract

Recent advancements in language models have demonstrated remarkable in-context learning abilities, prompting the exploration of in-context reinforcement learning (ICRL) to extend the promise to decision domains. Due to involving more complex dynamics and temporal correlations, existing ICRL approaches may face challenges in learning from suboptimal trajectories and achieving precise in-context inference. In the paper, we propose \textbf{S}calable \textbf{I}n-\textbf{C}ontext \textbf{Q}-\textbf{L}earning (\textbf{S-ICQL}), an innovative framework that harnesses dynamic programming and world modeling to steer ICRL toward efficient reward maximization and task generalization, while retaining the scalability and stability of supervised pretraining. We design a prompt-based multi-head transformer architecture that simultaneously predicts optimal policies and in-context value functions using separate heads. We pretrain a generalized world model to capture task-relevant information, enabling the construction of a compact prompt that facilitates fast and precise in-context inference. During training, we perform iterative policy improvement by fitting a state value function to an upper-expectile of the Q-function, and distill the in-context value functions into policy extraction using advantage-weighted regression. Extensive experiments across a range of discrete and continuous environments show consistent performance gains over various types of baselines, especially when learning from suboptimal data. Our code is available at \textcolor{magenta}{\href{https://github.com/NJU-RL/SICQL}{https://github.com/NJU-RL/SICQL}}.

Paper Structure

This paper contains 26 sections, 7 equations, 11 figures, 13 tables, 2 algorithms.

Figures (11)

  • Figure 1: The overview of S-ICQL. (a) We pretrain a generalized world model to accurately capture task-relevant information from the multi-task offline dataset as in Eq. (\ref{['eq:wm_loss']}), and use the context encoder to transform a small number of raw transitions into a precise and lightweight prompt $\beta$ as in Eq. (\ref{['eq:prompt']}). (b) We design a prompt-based multi-head transformer model that simultaneously predicts the optimal policy $\pi_\theta(a|s;\beta)$, the state value function $V_\theta(s;\beta)$, and Q-function $Q_\theta(s,a;\beta)$ using separate heads, given the task prompt $\beta$ and corresponding query inputs ($s$ or $s,a$). We learn $V_\theta$ by expectile regression as in Eq. (\ref{['eq:v_loss']}), and use it to compute Bellman backups for training $Q_\theta$ as in Eq. (\ref{['eq:q_loss']}). The in-context value functions are distilled into policy extraction using advantage-weighted regression as in Eq. (\ref{['eq:pi_loss']}). (c) Online testing by interacting with the environment. The prompt is initially empty and gradually constructed from history interactions using the pretrained context encoder.
  • Figure 2: Few-shot evaluation return curves of S-ICQL and baselines on Mixed datasets.
  • Figure 3: Few-shot evaluation return curves of S-ICQL and its ablations on Mixed datasets. w/o_c removes world modeling, w/o_q removes Q-learning, and w/o_cq removes both components.
  • Figure 4: Few-shot evaluation curves of S-ICQL and baselines for OOD tasks on Mixed datasets.
  • Figure 5: Comparison of best dataset returns with DPT, w/o_q, and S-ICQL on training tasks.
  • ...and 6 more figures