Integrating Domain Knowledge for handling Limited Data in Offline RL

Briti Gangopadhyay; Zhao Wang; Jia-Fong Yeh; Shingo Takamatsu

Integrating Domain Knowledge for handling Limited Data in Offline RL

Briti Gangopadhyay, Zhao Wang, Jia-Fong Yeh, Shingo Takamatsu

TL;DR

This work tackles offline reinforcement learning under limited data by introducing ExID, a domain knowledge based regularization framework. A teacher policy derived from hierarchical domain knowledge guides the offline critic through a Q-regularizer $\mathcal{L}_r(\theta)$ and a balanced loss $\mathcal{L}(\theta)=\mathcal{L}_{\mathrm{cql}}(\theta)+\lambda\mathbb{E}[Q_s^\theta(s,a_s)-Q_s^\theta(s,a_t)]^2$, with a two-phase warm-start and uncertainty based teacher updates. Empirically, ExID yields at least a 27% average improvement over strong offline RL baselines on discrete OpenAI Gym and MiniGrid datasets when data is scarce, and it demonstrates generalization to OOD states covered by $\mathcal{D}$ as $\mathcal{L}_r(\theta)$ becomes more informative with higher state coverage. The approach highlights the practical value of integrating domain knowledge into offline RL to improve robustness, while acknowledging dependence on the quality and coverage of the domain knowledge and pointing to future work in continuous action spaces and automatic domain knowledge extraction.

Abstract

With the ability to learn from static datasets, Offline Reinforcement Learning (RL) emerges as a compelling avenue for real-world applications. However, state-of-the-art offline RL algorithms perform sub-optimally when confronted with limited data confined to specific regions within the state space. The performance degradation is attributed to the inability of offline RL algorithms to learn appropriate actions for rare or unseen observations. This paper proposes a novel domain knowledge-based regularization technique and adaptively refines the initial domain knowledge to considerably boost performance in limited data with partially omitted states. The key insight is that the regularization term mitigates erroneous actions for sparse samples and unobserved states covered by domain knowledge. Empirical evaluations on standard discrete environment datasets demonstrate a substantial average performance increase of at least 27% compared to existing offline RL algorithms operating on limited data.

Integrating Domain Knowledge for handling Limited Data in Offline RL

TL;DR

and a balanced loss

, with a two-phase warm-start and uncertainty based teacher updates. Empirically, ExID yields at least a 27% average improvement over strong offline RL baselines on discrete OpenAI Gym and MiniGrid datasets when data is scarce, and it demonstrates generalization to OOD states covered by

becomes more informative with higher state coverage. The approach highlights the practical value of integrating domain knowledge into offline RL to improve robustness, while acknowledging dependence on the quality and coverage of the domain knowledge and pointing to future work in continuous action spaces and automatic domain knowledge extraction.

Abstract

Paper Structure (17 sections, 2 theorems, 8 equations, 17 figures, 5 tables, 1 algorithm)

This paper contains 17 sections, 2 theorems, 8 equations, 17 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Problem Setting and Methodology
Empirical Evaluations
Experimental Setting
Performance across Different Environments
Generalization to OOD states and contribution of $\mathcal{L}_r(\theta)$
Performance on varying $\lambda$, $k$, and ablation of $\pi_t^\omega$
Effect of varying $\mathcal{D}$ quality
Conclusion
Missing Examples and Proofs
Environments and Domain Knowledge Trees
Related Work: Knowledge Distillation
Network Architecture and Hyper-parameters
...and 2 more sections

Key Result

Proposition 4.2

Algo alg:1 reduces generalization error if $Q^*(s,\pi_t^\omega(s)) > Q^*(s,\pi(s))$ for $s \in \mathcal{D} \cap \mathcal{B}_r$, where $\pi$ is vanilla offline RL policy learnt on $\mathcal{B}_r$.

Figures (17)

Figure 1: a) Full expert, Mountain Car dataset, and reduced dataset with first 10% samples showing distribution of state (position, velocity) and action b) CQL agent converging to a sub-optimal policy for reduced dataset exhibiting high Q values for actions different from actions in the expert dataset for unseen states.
Figure 2: Overview of the proposed methodology (a) Training a teacher policy network with domain knowledge and synthetic data (b) Updating the offline RL critic network with teacher network
Figure 3: Performance of (a) CQL and (b) EXID on all datasets for Mountain Car during online evaluation (c) State action pairs used for training teacher network for Mountain Car
Figure 4: Q value difference between CQL and EXID for expert and policy action on states not present in the buffer for a) expert b) replay in log scale c) noisy in log scale d) contribution of $\mathcal{L}_r(\theta)$
Figure 5: (a) Effect of different $\lambda$ on the performance of EXID on Lunar Lander (b) Effect of different $k$ on the performance of EXID on Lunar Lander (c) Performance of EXID with teacher update, no teacher update, and just warm start on Cart-pole.
...and 12 more figures

Theorems & Definitions (4)

Definition 4.1
Proposition 4.2
Proposition 1.1
proof

Integrating Domain Knowledge for handling Limited Data in Offline RL

TL;DR

Abstract

Integrating Domain Knowledge for handling Limited Data in Offline RL

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (4)