Table of Contents
Fetching ...

In-Context Compositional Q-Learning for Offline Reinforcement Learning

Qiushui Xu, Yuhao Huang, Yushu Jiang, Lei Song, Jinyu Wang, Wenliang Zheng, Jiang Bian

Abstract

Accurate estimation of the Q-function is a central challenge in offline reinforcement learning. However, existing approaches often rely on a shared global Q-function, which is inadequate for capturing the compositional structure of tasks that consist of diverse subtasks. We propose In-context Compositional Q-Learning (ICQL), an offline RL framework that formulates Q-learning as a contextual inference problem and uses linear Transformers to adaptively infer local Q-functions from retrieved transitions without explicit subtask labels. Theoretically, we show that, under two assumptions -- linear approximability of the local Q-function and accurate inference of weights from retrieved context -- ICQL achieves a bounded approximation error for the Q-function and enables near-optimal policy extraction. Empirically, ICQL substantially improves performance in offline settings, achieving gains of up to 16.4% on kitchen tasks and up to 8.8% and 6.3% on MuJoCo and Adroit tasks, respectively. These results highlight the underexplored potential of in-context learning for robust and compositional value estimation and establish ICQL as a principled and effective framework for offline RL.

In-Context Compositional Q-Learning for Offline Reinforcement Learning

Abstract

Accurate estimation of the Q-function is a central challenge in offline reinforcement learning. However, existing approaches often rely on a shared global Q-function, which is inadequate for capturing the compositional structure of tasks that consist of diverse subtasks. We propose In-context Compositional Q-Learning (ICQL), an offline RL framework that formulates Q-learning as a contextual inference problem and uses linear Transformers to adaptively infer local Q-functions from retrieved transitions without explicit subtask labels. Theoretically, we show that, under two assumptions -- linear approximability of the local Q-function and accurate inference of weights from retrieved context -- ICQL achieves a bounded approximation error for the Q-function and enables near-optimal policy extraction. Empirically, ICQL substantially improves performance in offline settings, achieving gains of up to 16.4% on kitchen tasks and up to 8.8% and 6.3% on MuJoCo and Adroit tasks, respectively. These results highlight the underexplored potential of in-context learning for robust and compositional value estimation and establish ICQL as a principled and effective framework for offline RL.

Paper Structure

This paper contains 41 sections, 7 theorems, 51 equations, 10 figures, 15 tables.

Key Result

Theorem 3.5

Suppose Assumptions assump:feature_approx and assump:set_coverage hold, and the learned policy $\pi$ is greedy with respect to $\hat{Q}(s, a|\Omega^{d_k}_s)$. Then, with probability at least $1-\delta$, the performance gap is bounded as where $C>0$ depends on $B_\phi,B_r$ and the conditioning of the local Gram matrix.

Figures (10)

  • Figure 1: Center: dimension-reduced states and SAC value estimates on Walker2d-Medium-Expert. Left and right: two groups of similar states.
  • Figure 2: An overview of In-Context Compositional Q-Learning (ICQL). Given a query state-action pair $(s_{\rm query},a_{\rm query})$, the model embeds it with the feature extractor $\phi$, retrieves the top-$k$ most similar transitions from the offline dataset $\mathcal{D}$, and forms a local context set. A local linear Q-function $\hat{Q}(s, a|\Omega_{s_{\rm query}}^{d_k})$ is then estimated from the retrieved context and used to update the actor.
  • Figure 3: Q-value distribution on states after t-SNE dimension reduction, of Walker2d-Medium dataset. The partitioned value patterns support our hypothesis that Q-functions are inherently compositional, motivating localized value modeling.
  • Figure 4: Normalized scores of different number of in-context learning layers on Mujoco tasks. Each color represents different number of layers, and the y-axis represents the normalized score.
  • Figure 5: Normalized scores of context lengths on Mujoco tasks. Each color represents different context lengths, and the y-axis represents the normalized score.
  • ...and 5 more figures

Theorems & Definitions (20)

  • Definition 3.1
  • Definition 3.2: State-Similar Retrieval
  • Definition 3.3: Context-dependent Weights
  • Remark 3.2
  • Remark 3.4
  • Theorem 3.5: Policy Performance Gap
  • proof
  • Definition C.1: Random Retrieval
  • Definition C.2: State-Similar-with-High-Rewards Retrieval
  • Lemma D.1
  • ...and 10 more