Table of Contents
Fetching ...

Offline Action-Free Learning of Ex-BMDPs by Comparing Diverse Datasets

Alexander Levine, Peter Stone, Amy Zhang

TL;DR

The paper addresses learning compact, controllable representations from high-dimensional, noisy observations in Ex-BMDPs using action-free offline data. It introduces CRAFT, a non-recursive, comparison-based algorithm that leverages diverse offline datasets from two agents with different policies to cluster observation-pairs by latent-state dynamics, yielding provable sample-efficiency guarantees. Theoretical results establish a polynomial sample complexity bound and per-timestep encoders with high accuracy, demonstrated in a toy simulation where CRAFT outperforms simple baselines. This work provides a principled path toward practical representation learning from offline video data for control tasks in Ex-BMDP-like environments, with clear avenues for extending to more agents and nondeterministic dynamics.

Abstract

While sequential decision-making environments often involve high-dimensional observations, not all features of these observations are relevant for control. In particular, the observation space may capture factors of the environment which are not controllable by the agent, but which add complexity to the observation space. The need to ignore these "noise" features in order to operate in a tractably-small state space poses a challenge for efficient policy learning. Due to the abundance of video data available in many such environments, task-independent representation learning from action-free offline data offers an attractive solution. However, recent work has highlighted theoretical limitations in action-free learning under the Exogenous Block MDP (Ex-BMDP) model, where temporally-correlated noise features are present in the observations. To address these limitations, we identify a realistic setting where representation learning in Ex-BMDPs becomes tractable: when action-free video data from multiple agents with differing policies are available. Concretely, this paper introduces CRAFT (Comparison-based Representations from Action-Free Trajectories), a sample-efficient algorithm leveraging differences in controllable feature dynamics across agents to learn representations. We provide theoretical guarantees for CRAFT's performance and demonstrate its feasibility on a toy example, offering a foundation for practical methods in similar settings.

Offline Action-Free Learning of Ex-BMDPs by Comparing Diverse Datasets

TL;DR

The paper addresses learning compact, controllable representations from high-dimensional, noisy observations in Ex-BMDPs using action-free offline data. It introduces CRAFT, a non-recursive, comparison-based algorithm that leverages diverse offline datasets from two agents with different policies to cluster observation-pairs by latent-state dynamics, yielding provable sample-efficiency guarantees. Theoretical results establish a polynomial sample complexity bound and per-timestep encoders with high accuracy, demonstrated in a toy simulation where CRAFT outperforms simple baselines. This work provides a principled path toward practical representation learning from offline video data for control tasks in Ex-BMDP-like environments, with clear avenues for extending to more agents and nondeterministic dynamics.

Abstract

While sequential decision-making environments often involve high-dimensional observations, not all features of these observations are relevant for control. In particular, the observation space may capture factors of the environment which are not controllable by the agent, but which add complexity to the observation space. The need to ignore these "noise" features in order to operate in a tractably-small state space poses a challenge for efficient policy learning. Due to the abundance of video data available in many such environments, task-independent representation learning from action-free offline data offers an attractive solution. However, recent work has highlighted theoretical limitations in action-free learning under the Exogenous Block MDP (Ex-BMDP) model, where temporally-correlated noise features are present in the observations. To address these limitations, we identify a realistic setting where representation learning in Ex-BMDPs becomes tractable: when action-free video data from multiple agents with differing policies are available. Concretely, this paper introduces CRAFT (Comparison-based Representations from Action-Free Trajectories), a sample-efficient algorithm leveraging differences in controllable feature dynamics across agents to learn representations. We provide theoretical guarantees for CRAFT's performance and demonstrate its feasibility on a toy example, offering a foundation for practical methods in similar settings.

Paper Structure

This paper contains 22 sections, 9 theorems, 128 equations, 3 figures, 1 table, 2 algorithms.

Key Result

Theorem 3.1

Assume that CRAFT (Algorithm alg:main_alg in the Appendix) is given datasets $\tau_A$ and $\tau_B$ such that the assumptions given in Equations eq:noise_free_policy_property, eq:pair_coverage,eq:alpha_seperation, and eq:eta_coverage all hold. Then there exists an where $\mathcal{O}^*(f(x)) := \mathcal{O}(f(x) \log^k(f(x)))$, such that for any given $\delta, \epsilon_0 \geq 0$, if $\forall s_{h}^*

Figures (3)

  • Figure 1: Dynamics and composition of the two-step Ex-BMDP example in Section \ref{['sec:draft_two_step']}.
  • Figure 2: Illustration of recursive use of the "DRAFT" algorithm.
  • Figure 3: Schematic of the CRAFT algorithm. See text of Section \ref{['sec:craft_description']}.

Theorems & Definitions (17)

  • Theorem 3.1
  • Lemma C.1
  • proof
  • Corollary C.2
  • proof
  • Proposition C.3
  • proof
  • Lemma C.4
  • proof
  • Lemma C.5
  • ...and 7 more