Table of Contents
Fetching ...

Offline Imitation from Observation via Primal Wasserstein State Occupancy Matching

Kai Yan, Alexander G. Schwing, Yu-xiong Wang

TL;DR

This work tackles offline Learning from Observations (LfO) by introducing PW-DICE, a novel method that minimizes the primal Wasserstein distance between the learner’s and expert state occupancies using a contrastively learned distance metric. By integrating KL-based pessimistic regularizers, PW-DICE yields a single-level convex optimization whose dual variables enable weighted behavior cloning to recover the policy, and it recovers SMODICE as a special case. Empirically, PW-DICE outperforms state-of-the-art DICE-based methods and other Wasserstein approaches on tabular and MuJoCo benchmarks, demonstrating the importance of the distance metric and robustness to distorted dynamics. The approach unifies $f$-divergence and Wasserstein minimization within a single framework and provides practical improvements for offline LfO tasks with limited expert data and diverse environments.

Abstract

In real-world scenarios, arbitrary interactions with the environment can often be costly, and actions of expert demonstrations are not always available. To reduce the need for both, offline Learning from Observations (LfO) is extensively studied: the agent learns to solve a task given only expert states and task-agnostic non-expert state-action pairs. The state-of-the-art DIstribution Correction Estimation (DICE) methods, as exemplified by SMODICE, minimize the state occupancy divergence between the learner's and empirical expert policies. However, such methods are limited to either $f$-divergences (KL and $chi^2$) or Wasserstein distance with Rubinstein duality, the latter of which constrains the underlying distance metric crucial to the performance of Wasserstein-based solutions. To enable more flexible distance metrics, we propose Primal Wasserstein DICE (PW-DICE). It minimizes the primal Wasserstein distance between the learner and expert state occupancies and leverages a contrastively learned distance metric. Theoretically, our framework is a generalization of SMODICE, and is the first work that unifies $f$-divergence and Wasserstein minimization. Empirically, we find that PW-DICE improves upon several state-of-the-art methods. The code is available at https://github.com/KaiYan289/PW-DICE.

Offline Imitation from Observation via Primal Wasserstein State Occupancy Matching

TL;DR

This work tackles offline Learning from Observations (LfO) by introducing PW-DICE, a novel method that minimizes the primal Wasserstein distance between the learner’s and expert state occupancies using a contrastively learned distance metric. By integrating KL-based pessimistic regularizers, PW-DICE yields a single-level convex optimization whose dual variables enable weighted behavior cloning to recover the policy, and it recovers SMODICE as a special case. Empirically, PW-DICE outperforms state-of-the-art DICE-based methods and other Wasserstein approaches on tabular and MuJoCo benchmarks, demonstrating the importance of the distance metric and robustness to distorted dynamics. The approach unifies -divergence and Wasserstein minimization within a single framework and provides practical improvements for offline LfO tasks with limited expert data and diverse environments.

Abstract

In real-world scenarios, arbitrary interactions with the environment can often be costly, and actions of expert demonstrations are not always available. To reduce the need for both, offline Learning from Observations (LfO) is extensively studied: the agent learns to solve a task given only expert states and task-agnostic non-expert state-action pairs. The state-of-the-art DIstribution Correction Estimation (DICE) methods, as exemplified by SMODICE, minimize the state occupancy divergence between the learner's and empirical expert policies. However, such methods are limited to either -divergences (KL and ) or Wasserstein distance with Rubinstein duality, the latter of which constrains the underlying distance metric crucial to the performance of Wasserstein-based solutions. To enable more flexible distance metrics, we propose Primal Wasserstein DICE (PW-DICE). It minimizes the primal Wasserstein distance between the learner and expert state occupancies and leverages a contrastively learned distance metric. Theoretically, our framework is a generalization of SMODICE, and is the first work that unifies -divergence and Wasserstein minimization. Empirically, we find that PW-DICE improves upon several state-of-the-art methods. The code is available at https://github.com/KaiYan289/PW-DICE.
Paper Structure (33 sections, 4 theorems, 34 equations, 17 figures, 2 tables, 1 algorithm)

This paper contains 33 sections, 4 theorems, 34 equations, 17 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

For any MDP and feasible expert policy $\pi^E$, the inequality constraints in Eq. eq:primal_main with $\Pi\geq 0, d^\pi_{sa}\geq 0$ and $\Pi\in\Delta, d^\pi_{sa}\in\Delta$ are equivalent.

Figures (17)

  • Figure 1: An illustration of our method, PW-DICE. a) Problem setting: different trajectories are illustrated by different styles of arrows. b) PW-DICE minimizes regularized 1-Wasserstein distance between the learner's state occupancy $d^\pi_s(s)$ and the expert state occupancy $d^E_s(s)$. The underlying distance function is contrastively learned to represent the reachability between the states. c) With the matching result, weights are calculated for downstream weighted Behavior Cloning (BC) to retrieve the policy. High transparency indicates a small weight for the state and its corresponding action.
  • Figure 2: Performance comparison between the default (normalized cosine) distance metric and Euclidean distance metric using OTR luo2023otr (first column), and SMODICE ma2022smodice (second and third columns). The result shows that the underlying distance metric is crucial for the performance of Wasserstein-based methods.
  • Figure 3: The regret (reward gap between learner and expert) of each method on a tabular environment. We observe our method to work the best, regardless of the presence of a regularizer; the regularizer is more important in continuous MDPs.
  • Figure 4: Performance comparison on the MuJoCo testbed. SMODICE-KL and SMODICE-CHI stand for variants of SMODICE using different $f$-divergences (KL or $\chi^2$). Our method generally works the best (i.e., has the highest normalized reward) among all baselines.
  • Figure 5: An illustration of successful (coherent with the OTR paper) and failing reward assignment in OTR luo2023otr. OTR performs Wasserstein matching between uniform distributions over the states of each trajectory in the task-agnostic dataset and the expert dataset, instead of between policy distributions. The reward is calculated from the matching result. Such a solution may fail to differentiate good and bad trajectories by giving similar rewards, as shown in the failure case b).
  • ...and 12 more figures

Theorems & Definitions (8)

  • Lemma 1
  • Theorem 3.1
  • Lemma 1
  • proof
  • Claim 1
  • Corollary 3.2
  • proof
  • proof