$κ$-Explorer: A Unified Framework for Active Model Estimation in MDPs

Xihe Gu; Urbashi Mitra; Tara Javidi

$κ$-Explorer: A Unified Framework for Active Model Estimation in MDPs

Xihe Gu, Urbashi Mitra, Tara Javidi

TL;DR

A parameterized family of decomposable and concave objective functions that explicitly incorporate both intrinsic estimation complexity and extrinsic visitation frequency is introduced and $\kappa$-Explorer is proposed, an active exploration algorithm that performs Frank-Wolfe-style optimization over state-action occupancy measures.

Abstract

In tabular Markov decision processes (MDPs) with perfect state observability, each trajectory provides active samples from the transition distributions conditioned on state-action pairs. Consequently, accurate model estimation depends on how the exploration policy allocates visitation frequencies in accordance with the intrinsic complexity of each transition distribution. Building on recent work on coverage-based exploration, we introduce a parameterized family of decomposable and concave objective functions $U_κ$ that explicitly incorporate both intrinsic estimation complexity and extrinsic visitation frequency. Moreover, the curvature $κ$ provides a unified treatment of various global objectives, such as the average-case and worst-case estimation error objectives. Using the closed-form characterization of the gradient of $U_κ$, we propose $κ$-Explorer, an active exploration algorithm that performs Frank-Wolfe-style optimization over state-action occupancy measures. The diminishing-returns structure of $U_κ$ naturally prioritizes underexplored and high-variance transitions, while preserving smoothness properties that enable efficient optimization. We establish tight regret guarantees for $κ$-Explorer and further introduce a fully online and computationally efficient surrogate algorithm for practical use. Experiments on benchmark MDPs demonstrate that $κ$-Explorer provides superior performance compared to existing exploration strategies.

$κ$-Explorer: A Unified Framework for Active Model Estimation in MDPs

TL;DR

A parameterized family of decomposable and concave objective functions that explicitly incorporate both intrinsic estimation complexity and extrinsic visitation frequency is introduced and

-Explorer is proposed, an active exploration algorithm that performs Frank-Wolfe-style optimization over state-action occupancy measures.

Abstract

that explicitly incorporate both intrinsic estimation complexity and extrinsic visitation frequency. Moreover, the curvature

provides a unified treatment of various global objectives, such as the average-case and worst-case estimation error objectives. Using the closed-form characterization of the gradient of

, we propose

-Explorer, an active exploration algorithm that performs Frank-Wolfe-style optimization over state-action occupancy measures. The diminishing-returns structure of

naturally prioritizes underexplored and high-variance transitions, while preserving smoothness properties that enable efficient optimization. We establish tight regret guarantees for

-Explorer and further introduce a fully online and computationally efficient surrogate algorithm for practical use. Experiments on benchmark MDPs demonstrate that

-Explorer provides superior performance compared to existing exploration strategies.

Paper Structure (29 sections, 4 theorems, 78 equations, 1 table, 2 algorithms)

This paper contains 29 sections, 4 theorems, 78 equations, 1 table, 2 algorithms.

Introduction
Problem Formulation
Motivating Scenario
Empirical Distribution
Mean Square Estimation Error
Global Objective Function
Proposed Method: $\kappa$-Explorer
Smooth Objective Function $U_\kappa$ with Diminishing Returns
$\kappa$-Explorer Algorithm
Regret Analysis
Proof sketch
Proposed Fully Online and Efficient Heuristic
Implementation Variants of $\kappa$-Explorer
Experiments
Discretized MDP Environments
...and 14 more sections

Key Result

lemma 1

Consider the empirical estimator eq:p_hat_def and the squared $\ell_2$ estimation error criterion, for any given state-action pair $(s, a)$ under an ergodic policy $\pi$ and sufficiently large $n$,

Theorems & Definitions (8)

lemma 1
theorem 4
proof
lemma 2
proof
lemma 3
proof
proof

$κ$-Explorer: A Unified Framework for Active Model Estimation in MDPs

TL;DR

Abstract

$κ$-Explorer: A Unified Framework for Active Model Estimation in MDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (8)