Integrating Domain Knowledge for handling Limited Data in Offline RL
Briti Gangopadhyay, Zhao Wang, Jia-Fong Yeh, Shingo Takamatsu
TL;DR
This work tackles offline reinforcement learning under limited data by introducing ExID, a domain knowledge based regularization framework. A teacher policy derived from hierarchical domain knowledge guides the offline critic through a Q-regularizer $\mathcal{L}_r(\theta)$ and a balanced loss $\mathcal{L}(\theta)=\mathcal{L}_{\mathrm{cql}}(\theta)+\lambda\mathbb{E}[Q_s^\theta(s,a_s)-Q_s^\theta(s,a_t)]^2$, with a two-phase warm-start and uncertainty based teacher updates. Empirically, ExID yields at least a 27% average improvement over strong offline RL baselines on discrete OpenAI Gym and MiniGrid datasets when data is scarce, and it demonstrates generalization to OOD states covered by $\mathcal{D}$ as $\mathcal{L}_r(\theta)$ becomes more informative with higher state coverage. The approach highlights the practical value of integrating domain knowledge into offline RL to improve robustness, while acknowledging dependence on the quality and coverage of the domain knowledge and pointing to future work in continuous action spaces and automatic domain knowledge extraction.
Abstract
With the ability to learn from static datasets, Offline Reinforcement Learning (RL) emerges as a compelling avenue for real-world applications. However, state-of-the-art offline RL algorithms perform sub-optimally when confronted with limited data confined to specific regions within the state space. The performance degradation is attributed to the inability of offline RL algorithms to learn appropriate actions for rare or unseen observations. This paper proposes a novel domain knowledge-based regularization technique and adaptively refines the initial domain knowledge to considerably boost performance in limited data with partially omitted states. The key insight is that the regularization term mitigates erroneous actions for sparse samples and unobserved states covered by domain knowledge. Empirical evaluations on standard discrete environment datasets demonstrate a substantial average performance increase of at least 27% compared to existing offline RL algorithms operating on limited data.
