Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learning

Yeda Song; Dongwook Lee; Gunhee Kim

Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learning

Yeda Song, Dongwook Lee, Gunhee Kim

TL;DR

The paper tackles distributional shifts in offline reinforcement learning by introducing COmpositional COnservatism with Anchor-seeking (COCOA), a framework that enforces conservatism in a compositional input space via bilinear transduction. COCOA decomposes states into in-distribution anchors and deltas, guided by a learned reverse dynamics model and an anchor-seeking policy that steers trajectories toward known data regions. By applying bilinear transduction to policy and value networks and integrating an anchor-seeking module, COCOA improves the performance of multiple offline RL baselines on the D4RL benchmark, with ablation confirming the importance of the anchor-seeking component. The approach offers a flexible, add-on mechanism for enhancing generalization in offline RL by focusing conservatism in the input decomposition rather than solely on behavioral cues.

Abstract

Offline reinforcement learning (RL) is a compelling framework for learning optimal policies from past experiences without additional interaction with the environment. Nevertheless, offline RL inevitably faces the problem of distributional shifts, where the states and actions encountered during policy execution may not be in the training dataset distribution. A common solution involves incorporating conservatism into the policy or the value function to safeguard against uncertainties and unknowns. In this work, we focus on achieving the same objectives of conservatism but from a different perspective. We propose COmpositional COnservatism with Anchor-seeking (COCOA) for offline RL, an approach that pursues conservatism in a compositional manner on top of the transductive reparameterization (Netanyahu et al., 2023), which decomposes the input variable (the state in our case) into an anchor and its difference from the original input. Our COCOA seeks both in-distribution anchors and differences by utilizing the learned reverse dynamics model, encouraging conservatism in the compositional input space for the policy or value function. Such compositional conservatism is independent of and agnostic to the prevalent behavioral conservatism in offline RL. We apply COCOA to four state-of-the-art offline RL algorithms and evaluate them on the D4RL benchmark, where COCOA generally improves the performance of each algorithm. The code is available at https://github.com/runamu/compositional-conservatism.

Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learning

TL;DR

Abstract

Paper Structure (27 sections, 1 theorem, 10 equations, 6 figures, 6 tables, 3 algorithms)

This paper contains 27 sections, 1 theorem, 10 equations, 6 figures, 6 tables, 3 algorithms.

Introduction
Preliminaries
Offline RL
Bilinear Transduction
Compositional Conservatism with Anchor-seeking (COCOA)
Offline RL with Bilinear Transduction
Learning to Seek In-Distribution Decomposition
Anchor-Seeking Trajectory
Training of Dynamics-Aware Anchor-Seeking Policy
Summary of The Method
Experiments
Results of D4RL Benchmark Tasks
Ablation Study: The Effect of Anchor-Seeking
Related Work
Conclusion
...and 12 more sections

Key Result

Theorem 1

Suppose $X$ is a locally compact Hausdorff space and $A$ is a subalgebra of $C_0(X, \mathbb{R})$. Then $A$ is dense in $C_0(X, \mathbb{R})$ with respect to the topology of uniform convergence if and only if it separates points and vanishes nowhere.

Figures (6)

Figure 1: (a) An illustration of anchor-seeking rollouts that find anchors close to the seen area of the state space $\mathcal{S}$. Given the current state $s$, the anchor-seeking policy $\tilde{\pi}$ gives actions to reach the anchor $\tilde{s}$. Its behavior is derived by utilizing reverse model rollouts, which diverge from the offline dataset. (b) An illustration of the current state $s$ and an anchor $\tilde{s}$. Ideally, the anchor $\tilde{s}$ has been observed during the training phase when it served as an anchor for another state. Similarly, the difference, delta, had also been encountered previously in the ideal case but in combination with a different anchor. (c) The architecture of our policy $\pi(a|s)$ that aims to generalize to an unfamiliar state by decomposing the state $s$ into familiar components (seen anchor $\tilde{s}$ and seen delta $\mathit{\Delta} s$) and applying a transductive predictor. The architecture of the Q-function is similar to that of the policy.
Figure 2: Performance of CQL, IQL, MOPO, and MOBILE with and without COCOA on the halfcheetah-medium-expert-v2 task of D4RL.
Figure 3: Performance comparison of CQL, CQL+COCOA and CQL+COCOA without anchor-seeking across all D4RL tasks except for "random" tasks.
Figure 4: Performance comparison of IQL and IQL+COCOA across all D4RL tasks except for "random" tasks.
Figure 5: Performance comparison of MOPO and MOPO+COCOA across all D4RL tasks except for "random" tasks.
...and 1 more figures

Theorems & Definitions (1)

Theorem 1: Stone-Weierstrass Theorem for Locally Compact Spaces

Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learning

TL;DR

Abstract

Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (1)