Offline Diversity Maximization Under Imitation Constraints

Marin Vlastelica; Jin Cheng; Georg Martius; Pavel Kolev

Offline Diversity Maximization Under Imitation Constraints

Marin Vlastelica, Jin Cheng, Georg Martius, Pavel Kolev

TL;DR

The paper tackles unsupervised skill discovery in an offline setting by formulating a constrained mutual-information objective: maximize $\mathcal{I}(S;Z)$ for diverse skill-conditioned policies while enforcing a KL-divergence constraint $\mathrm{D}_\mathrm{KL}(d_z(S)\|d_E(S))\le \epsilon$ to ensure imitation of state-only expert demonstrations. It introduces Diverse Offline Imitation (DOI), a three-phase offline algorithm that leverages Fenchel duality to connect dual value functions with primal state-action occupancies, and uses offline importance ratios computed via SMODICE to train skill policies, a skill discriminator $q(z|s)$, and Lagrange multipliers that regulate the diversity-imitation trade-off. The method is evaluated on the D4RL offline benchmark and a 12-DoF Solo12 robot dataset, with additional sim-to-real transfer experiments showing robust policy transfer. Key findings include that larger $\epsilon$ yields more diverse skills as reflected by $\eta_z(s,a)$ and successor-feature distances, at the cost of some task performance, and that offline diversity aligns with online diversity metrics. The work advances practical offline skill discovery by providing a principled, tractable framework with clear trade-offs and robust performance in real-world robotics scenarios.

Abstract

There has been significant recent progress in the area of unsupervised skill discovery, utilizing various information-theoretic objectives as measures of diversity. Despite these advances, challenges remain: current methods require significant online interaction, fail to leverage vast amounts of available task-agnostic data and typically lack a quantitative measure of skill utility. We address these challenges by proposing a principled offline algorithm for unsupervised skill discovery that, in addition to maximizing diversity, ensures that each learned skill imitates state-only expert demonstrations to a certain degree. Our main analytical contribution is to connect Fenchel duality, reinforcement learning, and unsupervised skill discovery to maximize a mutual information objective subject to KL-divergence state occupancy constraints. Furthermore, we demonstrate the effectiveness of our method on the standard offline benchmark D4RL and on a custom offline dataset collected from a 12-DoF quadruped robot for which the policies trained in simulation transfer well to the real robotic system.

Offline Diversity Maximization Under Imitation Constraints

TL;DR

The paper tackles unsupervised skill discovery in an offline setting by formulating a constrained mutual-information objective: maximize

for diverse skill-conditioned policies while enforcing a KL-divergence constraint

to ensure imitation of state-only expert demonstrations. It introduces Diverse Offline Imitation (DOI), a three-phase offline algorithm that leverages Fenchel duality to connect dual value functions with primal state-action occupancies, and uses offline importance ratios computed via SMODICE to train skill policies, a skill discriminator

, and Lagrange multipliers that regulate the diversity-imitation trade-off. The method is evaluated on the D4RL offline benchmark and a 12-DoF Solo12 robot dataset, with additional sim-to-real transfer experiments showing robust policy transfer. Key findings include that larger

yields more diverse skills as reflected by

and successor-feature distances, at the cost of some task performance, and that offline diversity aligns with online diversity metrics. The work advances practical offline skill discovery by providing a principled, tractable framework with clear trade-offs and robust performance in real-world robotics scenarios.

Abstract

Paper Structure (43 sections, 10 theorems, 62 equations, 16 figures, 3 tables, 1 algorithm)

This paper contains 43 sections, 10 theorems, 62 equations, 16 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Method
Approximation Scheme
Approximation Phases
Phase 1
Phase 2
Phase 3
Algorithm
Experiments
Locomotion
Data collection.
Importance ratios distance.
Successor features distance.
...and 28 more sections

Key Result

Lemma 4.2

Given ratios $\eta_{z}(s,a)$, using imp-weight-proc applied with $f(s,a,z)=\log( {{\color{ourdarkblue} q(z|s)}} )$, we can compute offline an optimal skill-discriminator ${{\color{ourdarkblue} q^{\star}(z|s)}}$. In particular, we optimize by gradient descent the following optimization problem $\max_

Figures (16)

Figure 1: Diverse Offline Imitation (DOI) maximizes a variational lower bound on the mutual information between latent skills $z$ and states $s$ visited by associated skill-conditioned policies $\pi_z$, subject to a KL-divergence constraint to limit the deviation of the state occupancy $d_z(s)$ of each latent skill $z$ from that of an expert $d_E(s)$.
Figure 2: Illustration of \ref{['alg:doi']}. We compute expert importance ratios ${{\color{ourdarkorange} \eta_{\widetilde{E}}(s,a)}}$ by running SMODICE on the offline datasets $\mathcal{D}_{E}$ and $\mathcal{D}_{O}$. These expert ratios are then used in the alternating scheme described in \ref{['sec:alt-opt']} to obtain the importance ratios $\eta_{z}(s,a)$ (with support in $\mathcal{D}_{O}$) for each skill $z$. Specifically, the skill-ratios $\eta_{z}(s,a)$ are computed by a DICE-like offline policy evaluation algorithm on input a reward $R_z^\mu(s,a)$ that balances skill diversity (skill-discriminator ${{\color{ourdarkblue} q(z \vert s)}}$) and expert imitation (importance ratios ${{\color{ourdarkorange} \eta_{\widetilde{E}}(s,a)}}$).
Figure 3: Data points separation by importance ratios $\eta_z(s,a)$, given different levels of $\epsilon$ in Solo12. (a) Distribution of importance ratios $\eta_z(s,a)$ over the offline dataset $\mathcal{D}_{O}$ for distinct skills with DOI$^4$ ($\epsilon=4$) (upper) and a skill-conditioned variant of SMODICE (lower). (b) Average $\ell_1$ distance of ratios $\eta_z$ belonging to distinct skills, depending on $\epsilon$. The higher the value of $\epsilon$, the greater the $\ell_1$ distance. The shaded areas show the interval between the 0.25 and 0.75 quantiles, computed over 3 seeds.
Figure 4: (a) Average $\ell_2$ distance between Monte Carlo estimates of successor features $\psi_z$ of distinct skills; (b) return $r$ as $\%$ of expert return and standard deviation of base height $\mathrm{std}_z(h)$. Both depend on $\epsilon$ for the Solo12. The shaded areas show the interval between the 0.25 and 0.75 quantiles, computed over 3 seeds.
Figure 5: Successor Features projection onto 2D space using the UMAP algorithm.
...and 11 more figures

Theorems & Definitions (17)

Lemma 4.2
Lemma 4.3
Lemma D.1
proof
Lemma D.2: Discriminator Gradient
proof
Lemma D.3: State-Action KL Estimator
proof
Lemma D.4: Structural
proof
...and 7 more

Offline Diversity Maximization Under Imitation Constraints

TL;DR

Abstract

Offline Diversity Maximization Under Imitation Constraints

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (17)