Table of Contents
Fetching ...

DO-IQS: Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping with Unknown Gain Functions

Anna Kuchko

TL;DR

DO-IQS tackles inverse optimal stopping with unknown gain functions by coupling offline IQ-learning to a dynamics model and augmenting the state with a cumulative continuation gain $Y_t$, thereby handling non-Markovian rewards and boundary conditions in a data-sparse, offline setting. It introduces oversampling via CS-SMOTE to address stopping-data sparsity and develops a bi-level optimization loop that updates both the Q-function and an approximate environment model, enabling robust estimation of the stopping region $D^ star$ without environment queries. The method is evaluated on synthetic 2D Brownian motion and real critical-event datasets, showing improved stopping-region accuracy and balanced-accuracy metrics compared to baselines, with the DO-IQS-LB variant performing particularly well in sparse-data regimes. These contributions advance safe, offline inference of optimal stopping behavior in high-dimensional settings where the stopping surface is critical for risk-sensitive decisions.

Abstract

We consider the Inverse Optimal Stopping (IOS) problem where, based on stopped expert trajectories, one aims to recover the optimal stopping region through the continuation and stopping gain functions approximation. The uniqueness of the stopping region allows the use of IOS in real-world applications with safety concerns. Although current state-of-the-art inverse reinforcement learning methods recover both a Q-function and the corresponding optimal policy, they fail to account for specific challenges posed by optimal stopping problems. These include data sparsity near the stopping region, the non-Markovian nature of the continuation gain, a proper treatment of boundary conditions, the need for a stable offline approach for risk-sensitive applications, and a lack of a quality evaluation metric. These challenges are addressed with the proposed Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping (DO-IQS), which incorporates temporal information by approximating the cumulative continuation gain together with the world dynamics and the Q-function without querying to the environment. In addition, a confidence-based oversampling approach is proposed to treat the data sparsity problem. We demonstrate the performance of our models on real and artificial data including an optimal intervention for the critical events problem.

DO-IQS: Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping with Unknown Gain Functions

TL;DR

DO-IQS tackles inverse optimal stopping with unknown gain functions by coupling offline IQ-learning to a dynamics model and augmenting the state with a cumulative continuation gain , thereby handling non-Markovian rewards and boundary conditions in a data-sparse, offline setting. It introduces oversampling via CS-SMOTE to address stopping-data sparsity and develops a bi-level optimization loop that updates both the Q-function and an approximate environment model, enabling robust estimation of the stopping region without environment queries. The method is evaluated on synthetic 2D Brownian motion and real critical-event datasets, showing improved stopping-region accuracy and balanced-accuracy metrics compared to baselines, with the DO-IQS-LB variant performing particularly well in sparse-data regimes. These contributions advance safe, offline inference of optimal stopping behavior in high-dimensional settings where the stopping surface is critical for risk-sensitive decisions.

Abstract

We consider the Inverse Optimal Stopping (IOS) problem where, based on stopped expert trajectories, one aims to recover the optimal stopping region through the continuation and stopping gain functions approximation. The uniqueness of the stopping region allows the use of IOS in real-world applications with safety concerns. Although current state-of-the-art inverse reinforcement learning methods recover both a Q-function and the corresponding optimal policy, they fail to account for specific challenges posed by optimal stopping problems. These include data sparsity near the stopping region, the non-Markovian nature of the continuation gain, a proper treatment of boundary conditions, the need for a stable offline approach for risk-sensitive applications, and a lack of a quality evaluation metric. These challenges are addressed with the proposed Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping (DO-IQS), which incorporates temporal information by approximating the cumulative continuation gain together with the world dynamics and the Q-function without querying to the environment. In addition, a confidence-based oversampling approach is proposed to treat the data sparsity problem. We demonstrate the performance of our models on real and artificial data including an optimal intervention for the critical events problem.

Paper Structure

This paper contains 29 sections, 1 theorem, 38 equations, 18 figures, 3 tables, 3 algorithms.

Key Result

Theorem 1

Let X be an adapted right-continuous stochastic process defined on a filtered probability space. If D is a stopping set which is Borel-measurable, then: is a stopping time.

Figures (18)

  • Figure 1: POMDP structure of the OS problem with cumulative continuation gain (left) and DO-IQS model structure (right).
  • Figure 2: Left: OS with sparse observations. Red and green dots represent expert stopping and continuation decisions. Right: True Negatives (TN), True Positives (TP), False Negatives (FN) and False Positives (FP).
  • Figure 3: bmG
  • Figure 4: Top-to-bottom: a standard MDP with zero reward in absorbing states; MDP with non-zero reward in absorbing states; SMDP with stopping and continuation gains and zero-reward absorbing states.
  • Figure 5: ANN structure for the Model-based IQS
  • ...and 13 more figures

Theorems & Definitions (9)

  • proof
  • Theorem 1: Debut Theorem
  • proof
  • proof
  • Example C.1: CP1
  • Example C.2: CP2
  • Example C.3: CP3
  • Example C.4: Radial stopping
  • Example C.5: STAR-stopping