Offline Diversity Maximization Under Imitation Constraints
Marin Vlastelica, Jin Cheng, Georg Martius, Pavel Kolev
TL;DR
The paper tackles unsupervised skill discovery in an offline setting by formulating a constrained mutual-information objective: maximize $\mathcal{I}(S;Z)$ for diverse skill-conditioned policies while enforcing a KL-divergence constraint $\mathrm{D}_\mathrm{KL}(d_z(S)\|d_E(S))\le \epsilon$ to ensure imitation of state-only expert demonstrations. It introduces Diverse Offline Imitation (DOI), a three-phase offline algorithm that leverages Fenchel duality to connect dual value functions with primal state-action occupancies, and uses offline importance ratios computed via SMODICE to train skill policies, a skill discriminator $q(z|s)$, and Lagrange multipliers that regulate the diversity-imitation trade-off. The method is evaluated on the D4RL offline benchmark and a 12-DoF Solo12 robot dataset, with additional sim-to-real transfer experiments showing robust policy transfer. Key findings include that larger $\epsilon$ yields more diverse skills as reflected by $\eta_z(s,a)$ and successor-feature distances, at the cost of some task performance, and that offline diversity aligns with online diversity metrics. The work advances practical offline skill discovery by providing a principled, tractable framework with clear trade-offs and robust performance in real-world robotics scenarios.
Abstract
There has been significant recent progress in the area of unsupervised skill discovery, utilizing various information-theoretic objectives as measures of diversity. Despite these advances, challenges remain: current methods require significant online interaction, fail to leverage vast amounts of available task-agnostic data and typically lack a quantitative measure of skill utility. We address these challenges by proposing a principled offline algorithm for unsupervised skill discovery that, in addition to maximizing diversity, ensures that each learned skill imitates state-only expert demonstrations to a certain degree. Our main analytical contribution is to connect Fenchel duality, reinforcement learning, and unsupervised skill discovery to maximize a mutual information objective subject to KL-divergence state occupancy constraints. Furthermore, we demonstrate the effectiveness of our method on the standard offline benchmark D4RL and on a custom offline dataset collected from a 12-DoF quadruped robot for which the policies trained in simulation transfer well to the real robotic system.
