From Robotics to Sepsis Treatment: Offline RL via Geometric Pessimism
Sarthak Wanjari
TL;DR
The paper tackles offline reinforcement learning under distributional shift in high-stakes settings by introducing Geo-IQL, a compute-efficient extension of Implicit Q-Learning that injects geometry-based pessimism. It constructs a four-stage pipeline—state-action embedding, geometric uncertainty via $k$-NN distances, robust standardisation, and a density-adaptive reward penalty—to precompute OOD penalties and apply them during training with $y_{geo} = r_{geo}(s,a) + \gamma V(s')$. Theoretical justification guarantees a pessimistic bound on the learned Q-values under Lipschitz assumptions, and empirical results show substantial gains on fractured D4RL MuJoCo tasks and improved clinical outcomes in MIMIC-III Sepsis without sacrificing safety. The method achieves $\mathcal{O}(1)$ training overhead, enabling practical deployment on modest hardware while providing improved stability, better policy improvement, and stronger alignment with clinician decisions in real-world decision support.
Abstract
Offline Reinforcement Learning (RL) promises the recovery of optimal policies from static datasets, yet it remains susceptible to the overestimation of out-of-distribution (OOD) actions, particularly in fractured and sparse data manifolds.Current solutions necessitates a trade off between computational efficiency and performance. Methods like CQL offers rigorous conservatism but require tremendous compute power while efficient expectile-based methods like IQL often fail to correct OOD errors on pathological datasets, collapsing to Behavioural Cloning. In this work, we propose Geometric Pessimism, a modular, compute-efficient framework that augments standard IQL with density-based penalty derived from k-nearest-neighbour distances in the state-action embedding space. By pre-computing the penalties applied to each state-action pair our method injects OOD conservatism via reward shaping with a O(1) training overhead. Evaluated on the D4Rl MuJoCo benchmark, our method, Geo-IQL outperforms standard IQL on sensitive and unstable medium-replay tasks by over 18 points, while reducing inter-seed variance by 4x. Furthermore, Geo-IQL does not degrade performance on stable manifolds. Crucially, we validate our algorithm on the MIMIC-III Sepsis critical care dataset. While standard IQL collapses to behaviour cloning, Geo-IQL demonstrates active policy improvement. Maintaining safety constraints, achieving 86.4% terminal agreement with clinicians compared to IQL's 75%. Our results suggest that geometric pessimism provides the necessary regularisation to safely overcome local optima in critical, real-world decision systems.
