Table of Contents
Fetching ...

From Robotics to Sepsis Treatment: Offline RL via Geometric Pessimism

Sarthak Wanjari

TL;DR

The paper tackles offline reinforcement learning under distributional shift in high-stakes settings by introducing Geo-IQL, a compute-efficient extension of Implicit Q-Learning that injects geometry-based pessimism. It constructs a four-stage pipeline—state-action embedding, geometric uncertainty via $k$-NN distances, robust standardisation, and a density-adaptive reward penalty—to precompute OOD penalties and apply them during training with $y_{geo} = r_{geo}(s,a) + \gamma V(s')$. Theoretical justification guarantees a pessimistic bound on the learned Q-values under Lipschitz assumptions, and empirical results show substantial gains on fractured D4RL MuJoCo tasks and improved clinical outcomes in MIMIC-III Sepsis without sacrificing safety. The method achieves $\mathcal{O}(1)$ training overhead, enabling practical deployment on modest hardware while providing improved stability, better policy improvement, and stronger alignment with clinician decisions in real-world decision support.

Abstract

Offline Reinforcement Learning (RL) promises the recovery of optimal policies from static datasets, yet it remains susceptible to the overestimation of out-of-distribution (OOD) actions, particularly in fractured and sparse data manifolds.Current solutions necessitates a trade off between computational efficiency and performance. Methods like CQL offers rigorous conservatism but require tremendous compute power while efficient expectile-based methods like IQL often fail to correct OOD errors on pathological datasets, collapsing to Behavioural Cloning. In this work, we propose Geometric Pessimism, a modular, compute-efficient framework that augments standard IQL with density-based penalty derived from k-nearest-neighbour distances in the state-action embedding space. By pre-computing the penalties applied to each state-action pair our method injects OOD conservatism via reward shaping with a O(1) training overhead. Evaluated on the D4Rl MuJoCo benchmark, our method, Geo-IQL outperforms standard IQL on sensitive and unstable medium-replay tasks by over 18 points, while reducing inter-seed variance by 4x. Furthermore, Geo-IQL does not degrade performance on stable manifolds. Crucially, we validate our algorithm on the MIMIC-III Sepsis critical care dataset. While standard IQL collapses to behaviour cloning, Geo-IQL demonstrates active policy improvement. Maintaining safety constraints, achieving 86.4% terminal agreement with clinicians compared to IQL's 75%. Our results suggest that geometric pessimism provides the necessary regularisation to safely overcome local optima in critical, real-world decision systems.

From Robotics to Sepsis Treatment: Offline RL via Geometric Pessimism

TL;DR

The paper tackles offline reinforcement learning under distributional shift in high-stakes settings by introducing Geo-IQL, a compute-efficient extension of Implicit Q-Learning that injects geometry-based pessimism. It constructs a four-stage pipeline—state-action embedding, geometric uncertainty via -NN distances, robust standardisation, and a density-adaptive reward penalty—to precompute OOD penalties and apply them during training with . Theoretical justification guarantees a pessimistic bound on the learned Q-values under Lipschitz assumptions, and empirical results show substantial gains on fractured D4RL MuJoCo tasks and improved clinical outcomes in MIMIC-III Sepsis without sacrificing safety. The method achieves training overhead, enabling practical deployment on modest hardware while providing improved stability, better policy improvement, and stronger alignment with clinician decisions in real-world decision support.

Abstract

Offline Reinforcement Learning (RL) promises the recovery of optimal policies from static datasets, yet it remains susceptible to the overestimation of out-of-distribution (OOD) actions, particularly in fractured and sparse data manifolds.Current solutions necessitates a trade off between computational efficiency and performance. Methods like CQL offers rigorous conservatism but require tremendous compute power while efficient expectile-based methods like IQL often fail to correct OOD errors on pathological datasets, collapsing to Behavioural Cloning. In this work, we propose Geometric Pessimism, a modular, compute-efficient framework that augments standard IQL with density-based penalty derived from k-nearest-neighbour distances in the state-action embedding space. By pre-computing the penalties applied to each state-action pair our method injects OOD conservatism via reward shaping with a O(1) training overhead. Evaluated on the D4Rl MuJoCo benchmark, our method, Geo-IQL outperforms standard IQL on sensitive and unstable medium-replay tasks by over 18 points, while reducing inter-seed variance by 4x. Furthermore, Geo-IQL does not degrade performance on stable manifolds. Crucially, we validate our algorithm on the MIMIC-III Sepsis critical care dataset. While standard IQL collapses to behaviour cloning, Geo-IQL demonstrates active policy improvement. Maintaining safety constraints, achieving 86.4% terminal agreement with clinicians compared to IQL's 75%. Our results suggest that geometric pessimism provides the necessary regularisation to safely overcome local optima in critical, real-world decision systems.
Paper Structure (37 sections, 1 theorem, 28 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 37 sections, 1 theorem, 28 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1.7

Under Assumptions 1 and 2, for any $(s,a) \in \mathcal{S} \times \mathcal{A}$, if the penalty parameter satisfies where then

Figures (6)

  • Figure 1: 3-Dimensional visual of using geometry as a proxy for epistemic uncertainty.
  • Figure 2: Visualizing the Adaptive Safety Mechanism. The blue cloud represents the training data manifold. Green Point: A query inside the dense region. The algorithm detects near-zero distance to neighbours and applies no penalty. Yellow Point: A query slightly off-manifold. This triggers the adaptive lambda, applying a mild correctional penalty. Red Point: A query far from the data. Effectively, OOD. The algorithm measures a large mean Euclidean distance from its neighbours and applies a heavy penalty to reject this action entirely.
  • Figure 3: Performance over 1M training steps.
  • Figure 4: Q-Improvement of IQl and Geo-IQL
  • Figure 5: Triangle Inequality
  • ...and 1 more figures

Theorems & Definitions (7)

  • Definition 1.1: State-Action Space
  • Definition 1.2: Offline Dataset
  • Definition 1.3: True Q-function
  • Definition 1.4: Learned Q-function
  • Definition 1.5: Distance to Dataset
  • Definition 1.6: Geo-IQL Estimate
  • Theorem 1.7