DO-IQS: Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping with Unknown Gain Functions

Anna Kuchko

DO-IQS: Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping with Unknown Gain Functions

Anna Kuchko

TL;DR

DO-IQS tackles inverse optimal stopping with unknown gain functions by coupling offline IQ-learning to a dynamics model and augmenting the state with a cumulative continuation gain $Y_t$, thereby handling non-Markovian rewards and boundary conditions in a data-sparse, offline setting. It introduces oversampling via CS-SMOTE to address stopping-data sparsity and develops a bi-level optimization loop that updates both the Q-function and an approximate environment model, enabling robust estimation of the stopping region $D^ star$ without environment queries. The method is evaluated on synthetic 2D Brownian motion and real critical-event datasets, showing improved stopping-region accuracy and balanced-accuracy metrics compared to baselines, with the DO-IQS-LB variant performing particularly well in sparse-data regimes. These contributions advance safe, offline inference of optimal stopping behavior in high-dimensional settings where the stopping surface is critical for risk-sensitive decisions.

Abstract

We consider the Inverse Optimal Stopping (IOS) problem where, based on stopped expert trajectories, one aims to recover the optimal stopping region through the continuation and stopping gain functions approximation. The uniqueness of the stopping region allows the use of IOS in real-world applications with safety concerns. Although current state-of-the-art inverse reinforcement learning methods recover both a Q-function and the corresponding optimal policy, they fail to account for specific challenges posed by optimal stopping problems. These include data sparsity near the stopping region, the non-Markovian nature of the continuation gain, a proper treatment of boundary conditions, the need for a stable offline approach for risk-sensitive applications, and a lack of a quality evaluation metric. These challenges are addressed with the proposed Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping (DO-IQS), which incorporates temporal information by approximating the cumulative continuation gain together with the world dynamics and the Q-function without querying to the environment. In addition, a confidence-based oversampling approach is proposed to treat the data sparsity problem. We demonstrate the performance of our models on real and artificial data including an optimal intervention for the critical events problem.

DO-IQS: Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping with Unknown Gain Functions

TL;DR

DO-IQS tackles inverse optimal stopping with unknown gain functions by coupling offline IQ-learning to a dynamics model and augmenting the state with a cumulative continuation gain

, thereby handling non-Markovian rewards and boundary conditions in a data-sparse, offline setting. It introduces oversampling via CS-SMOTE to address stopping-data sparsity and develops a bi-level optimization loop that updates both the Q-function and an approximate environment model, enabling robust estimation of the stopping region

without environment queries. The method is evaluated on synthetic 2D Brownian motion and real critical-event datasets, showing improved stopping-region accuracy and balanced-accuracy metrics compared to baselines, with the DO-IQS-LB variant performing particularly well in sparse-data regimes. These contributions advance safe, offline inference of optimal stopping behavior in high-dimensional settings where the stopping surface is critical for risk-sensitive decisions.

DO-IQS: Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping with Unknown Gain Functions

TL;DR

Abstract

DO-IQS: Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping with Unknown Gain Functions

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (9)