Table of Contents
Fetching ...

Epistemic Exploration for Generalizable Planning and Learning in Non-Stationary Settings

Rushang Karia, Pulkit Verma, Alberto Speranzon, Siddharth Srivastava

TL;DR

The paper tackles continual planning and model learning in non-stationary stochastic environments with unknown dynamics by introducing CLaP, an adaptive loop that interleaves epistemic exploration with planning on learned relational PPDDL models. CLaP detects inconsistencies between observed transitions and the current model, guides targeted exploration through FOND-based queries, and updates only the inconsistent parts of the model to maintain sample efficiency. Theoretical results on variational distance show locally convergent learning within stationary epochs, and empirical results across four IPPC domains demonstrate strong transfer and near-Oracle performance with significantly reduced sample complexity compared to baselines. This approach enables robust deployment of planning systems in dynamic real-world settings by combining active learning, epistemic planning, and symbolic relational models for generalization and adaptation.

Abstract

This paper introduces a new approach for continual planning and model learning in relational, non-stationary stochastic environments. Such capabilities are essential for the deployment of sequential decision-making systems in the uncertain and constantly evolving real world. Working in such practical settings with unknown (and non-stationary) transition systems and changing tasks, the proposed framework models gaps in the agent's current state of knowledge and uses them to conduct focused, investigative explorations. Data collected using these explorations is used for learning generalizable probabilistic models for solving the current task despite continual changes in the environment dynamics. Empirical evaluations on several non-stationary benchmark domains show that this approach significantly outperforms planning and RL baselines in terms of sample complexity. Theoretical results show that the system exhibits desirable convergence properties when stationarity holds.

Epistemic Exploration for Generalizable Planning and Learning in Non-Stationary Settings

TL;DR

The paper tackles continual planning and model learning in non-stationary stochastic environments with unknown dynamics by introducing CLaP, an adaptive loop that interleaves epistemic exploration with planning on learned relational PPDDL models. CLaP detects inconsistencies between observed transitions and the current model, guides targeted exploration through FOND-based queries, and updates only the inconsistent parts of the model to maintain sample efficiency. Theoretical results on variational distance show locally convergent learning within stationary epochs, and empirical results across four IPPC domains demonstrate strong transfer and near-Oracle performance with significantly reduced sample complexity compared to baselines. This approach enables robust deployment of planning systems in dynamic real-world settings by combining active learning, epistemic planning, and symbolic relational models for generalization and adaptation.

Abstract

This paper introduces a new approach for continual planning and model learning in relational, non-stationary stochastic environments. Such capabilities are essential for the deployment of sequential decision-making systems in the uncertain and constantly evolving real world. Working in such practical settings with unknown (and non-stationary) transition systems and changing tasks, the proposed framework models gaps in the agent's current state of knowledge and uses them to conduct focused, investigative explorations. Data collected using these explorations is used for learning generalizable probabilistic models for solving the current task despite continual changes in the environment dynamics. Empirical evaluations on several non-stationary benchmark domains show that this approach significantly outperforms planning and RL baselines in terms of sample complexity. Theoretical results show that the system exhibits desirable convergence properties when stationarity holds.
Paper Structure (11 sections, 1 theorem, 3 equations, 1 figure, 1 algorithm)

This paper contains 11 sections, 1 theorem, 3 equations, 1 figure, 1 algorithm.

Key Result

Theorem 1

Let $M$ be an RMDP with a series of transition system changes $\delta_1, \ldots, \delta_n$ at timesteps $t_1, \ldots, t_n$ implemented using a simulator $\Delta$, then during each stationary epoch between $t_i$ and $t_{i+1}$ Alg. alg:learn_and_plan performs locally convergent model learning.

Figures (1)

  • Figure 1: Results (best viewed in color) from our experiments averaged across 10 runs with 1-std deviation (shaded). (a) plots the learning curves of the methods, (b) plots the avg. reward obtained by greedily running the policy computed 10 times (for clarity, the Oracle's avg. reward is annotated with $\times$ periodically), (c) plots the total steps needed to achieve steady-state performance (defined in Sec. \ref{['subsec:analysis_results']}) equal to the Oracle's. Higher values are better for (a) and (b); lower for (c). Vertical squiggly lines denote the step where a new task $M_{i+1}$ and transition system $\delta_{i+1}$ were loaded ($M_i \not= M_{i+1}$ and $\delta_i \not= \delta_{i+1}$).

Theorems & Definitions (9)

  • Definition 2.1: $\mathcal{M}$-Consistent Transition
  • Definition 2.2: Policy Trace
  • Definition 2.3: $p$-distinguishing policies
  • Definition 3.1: RMDP equivalence
  • Definition 3.2: Continual Planning under Non-Stationarity
  • Definition 3.3: Variational Distance (VD)
  • Definition 3.4: Locally Convergent Model Learning
  • Theorem 1
  • proof : Proof (Sketch)