Table of Contents
Fetching ...

Offline Learning of Controllable Diverse Behaviors

Mathieu Petitbois, Rémy Portelas, Sylvain Lamprier, Ludovic Denoyer

TL;DR

The paper tackles offline imitation learning from diverse demonstrations, addressing the limitation that a single policy cannot reproduce the full trajectory distribution of humans. It introduces Stylized Imitation Learning with temporal consistency and a controllable latent style, instantiated as ZBC and SWR (with WZBC) to reconstruct trajectory-scale diversity offline. The approach formalizes trajectory-scale diversity, employs a simple trajectory-index style embedding, and uses dissimilarity-based weighting to balance fidelity and controllability. Experiments on Maze2D and D3IL show improved trajectory-scale diversity reconstruction, controllability, and robustness to stochasticity, offering a practical path for diverse, controllable behaviors in games and robotics.

Abstract

Imitation Learning (IL) techniques aim to replicate human behaviors in specific tasks. While IL has gained prominence due to its effectiveness and efficiency, traditional methods often focus on datasets collected from experts to produce a single efficient policy. Recently, extensions have been proposed to handle datasets of diverse behaviors by mainly focusing on learning transition-level diverse policies or on performing entropy maximization at the trajectory level. While these methods may lead to diverse behaviors, they may not be sufficient to reproduce the actual diversity of demonstrations or to allow controlled trajectory generation. To overcome these drawbacks, we propose a different method based on two key features: a) Temporal Consistency that ensures consistent behaviors across entire episodes and not just at the transition level as well as b) Controllability obtained by constructing a latent space of behaviors that allows users to selectively activate specific behaviors based on their requirements. We compare our approach to state-of-the-art methods over a diverse set of tasks and environments. Project page: https://mathieu-petitbois.github.io/projects/swr/

Offline Learning of Controllable Diverse Behaviors

TL;DR

The paper tackles offline imitation learning from diverse demonstrations, addressing the limitation that a single policy cannot reproduce the full trajectory distribution of humans. It introduces Stylized Imitation Learning with temporal consistency and a controllable latent style, instantiated as ZBC and SWR (with WZBC) to reconstruct trajectory-scale diversity offline. The approach formalizes trajectory-scale diversity, employs a simple trajectory-index style embedding, and uses dissimilarity-based weighting to balance fidelity and controllability. Experiments on Maze2D and D3IL show improved trajectory-scale diversity reconstruction, controllability, and robustness to stochasticity, offering a practical path for diverse, controllable behaviors in games and robotics.

Abstract

Imitation Learning (IL) techniques aim to replicate human behaviors in specific tasks. While IL has gained prominence due to its effectiveness and efficiency, traditional methods often focus on datasets collected from experts to produce a single efficient policy. Recently, extensions have been proposed to handle datasets of diverse behaviors by mainly focusing on learning transition-level diverse policies or on performing entropy maximization at the trajectory level. While these methods may lead to diverse behaviors, they may not be sufficient to reproduce the actual diversity of demonstrations or to allow controlled trajectory generation. To overcome these drawbacks, we propose a different method based on two key features: a) Temporal Consistency that ensures consistent behaviors across entire episodes and not just at the transition level as well as b) Controllability obtained by constructing a latent space of behaviors that allows users to selectively activate specific behaviors based on their requirements. We compare our approach to state-of-the-art methods over a diverse set of tasks and environments. Project page: https://mathieu-petitbois.github.io/projects/swr/

Paper Structure

This paper contains 21 sections, 9 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: (Left) Trajectories in the maze: the start of a trajectory is shown in blue, the end in yellow, and the goal in green. (Right) Histograms of the behavior distribution of the datasets.
  • Figure 2: (Left) Pictures of the environments. (Right) Histograms of the behavior distribution of the datasets. In blue are the provided dataset’s behaviors and in yellow are those of our unbalanced dataset.
  • Figure 3: Dissimilarity values of trajectories$\nu({\color{red} \tau^*},\tau)$ for different reference trajectories ${\color{red} \tau^*}$ in red. Blue trajectories are the most similar, green the most dissimilar.
  • Figure 4: Values of the conditional input sample distributions $\rho(\tilde{s}|z)$. $\beta = 0$ give similar input sample distribution as BC, while $\beta = 100$ give a similar input sampling distribution as ZBC. We can find a middle ground with $\beta = 3.0$, allowing all the support of BC but with a significant weighting to distinguish the trajectories similarities.
  • Figure 5: (Left) L1 distance between the training behavior histogram and respectively: the property controlled agent evaluation behavior histogram, the free agent evaluation histogram (without filtering controls for desired lengths) and the controlled train histogram. (Right) In blue: the training length histogram, purple: the conditioned training length histogram, green: the free agent eval length histogram, yellow: the controlled eval length histogram.