Table of Contents
Fetching ...

SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution

Zhixuan Liang, Yao Mu, Hengbo Ma, Masayoshi Tomizuka, Mingyu Ding, Ping Luo

TL;DR

SkillDiffuser presents an end-to-end hierarchical framework that learns discrete, interpretable skills from visual observations and language, then conditions a diffusion-based planner on these skills to generate coherent, multi-step state trajectories. An inverse dynamics network converts predicted states into actions, enabling execution from language instructions without proprioceptive input. The approach combines horizon-based skill discovery via vector quantization with classifier-free diffusion guidance, achieving state-of-the-art results on LOReL Sawyer and Meta-World MT10 while yielding interpretable visualizations of learned skills. The work demonstrates strong generalization to unseen compositions and provides evidence of skill reusability across tasks, advancing interpretable, adaptable diffusion-based planning for complex manipulation tasks.

Abstract

Diffusion models have demonstrated strong potential for robotic trajectory planning. However, generating coherent trajectories from high-level instructions remains challenging, especially for long-range composition tasks requiring multiple sequential skills. We propose SkillDiffuser, an end-to-end hierarchical planning framework integrating interpretable skill learning with conditional diffusion planning to address this problem. At the higher level, the skill abstraction module learns discrete, human-understandable skill representations from visual observations and language instructions. These learned skill embeddings are then used to condition the diffusion model to generate customized latent trajectories aligned with the skills. This allows generating diverse state trajectories that adhere to the learnable skills. By integrating skill learning with conditional trajectory generation, SkillDiffuser produces coherent behavior following abstract instructions across diverse tasks. Experiments on multi-task robotic manipulation benchmarks like Meta-World and LOReL demonstrate state-of-the-art performance and human-interpretable skill representations from SkillDiffuser. More visualization results and information could be found on our website.

SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution

TL;DR

SkillDiffuser presents an end-to-end hierarchical framework that learns discrete, interpretable skills from visual observations and language, then conditions a diffusion-based planner on these skills to generate coherent, multi-step state trajectories. An inverse dynamics network converts predicted states into actions, enabling execution from language instructions without proprioceptive input. The approach combines horizon-based skill discovery via vector quantization with classifier-free diffusion guidance, achieving state-of-the-art results on LOReL Sawyer and Meta-World MT10 while yielding interpretable visualizations of learned skills. The work demonstrates strong generalization to unseen compositions and provides evidence of skill reusability across tasks, advancing interpretable, adaptable diffusion-based planning for complex manipulation tasks.

Abstract

Diffusion models have demonstrated strong potential for robotic trajectory planning. However, generating coherent trajectories from high-level instructions remains challenging, especially for long-range composition tasks requiring multiple sequential skills. We propose SkillDiffuser, an end-to-end hierarchical planning framework integrating interpretable skill learning with conditional diffusion planning to address this problem. At the higher level, the skill abstraction module learns discrete, human-understandable skill representations from visual observations and language instructions. These learned skill embeddings are then used to condition the diffusion model to generate customized latent trajectories aligned with the skills. This allows generating diverse state trajectories that adhere to the learnable skills. By integrating skill learning with conditional trajectory generation, SkillDiffuser produces coherent behavior following abstract instructions across diverse tasks. Experiments on multi-task robotic manipulation benchmarks like Meta-World and LOReL demonstrate state-of-the-art performance and human-interpretable skill representations from SkillDiffuser. More visualization results and information could be found on our website.
Paper Structure (42 sections, 3 theorems, 29 equations, 14 figures, 10 tables, 2 algorithms)

This paper contains 42 sections, 3 theorems, 29 equations, 14 figures, 10 tables, 2 algorithms.

Key Result

Theorem A.1

The conditional sampling probability of reverse diffusion process $p_{\theta,\phi}(\boldsymbol{\tau}^i \mid \boldsymbol{\tau}^{i+1},\boldsymbol{y})$ is proportional to unconditional transition probability $p_{\theta}(\boldsymbol{\tau}^i \mid \boldsymbol{\tau}^{i+1})$ multiplied by the classified pro

Figures (14)

  • Figure 1: Comparison of SkillDiffuser and previous language conditioned diffusers. SkillDiffuser utilizes high-level abstraction to translate visual observations and language instructions into human understandable skills with language grounding. It then enables the low-level diffusion model condition on these skills, not only improving the execution performance of multi-step composition tasks but greatly enhancing the generalization and adaptability of the framework.
  • Figure 2: Overall framework of SkillDiffuser. It's a hierarchical planning model that leverages the cooperation of interpretable skill abstractions at the higher level and a skill conditioned diffusion model at the lower level for task execution in a multi-task learning environment. The high-level skill abstraction is achieved through a skill predictor and a vector quantization operation, generating sub-goals (skill set) that the diffusion model employs to determine the appropriate future states. Future states are converted to actions using an inverse dynamics model. This unique fusion enables a consistent underlying planner across different tasks, with the variation only in the inverse dynamics model.
  • Figure 3: SkillDiffuser's low level skill-conditioned diffusion planning model. Notably, while the schematic here employs images to represent visual features for illustrative purposes, in actual implementation, both the input to and the sampling output of the diffusion model are state embeddings. The current observation is also the feature embedding of current visual observation.
  • Figure 4: Visualization of skill heat map on LOReL Sawyer compositional tasks. We display the word frequency associated with a skill set of size 20 in LOReL, normalized by column. The data's sparsity and distinct highlights indicate certain language tokens are uniquely linked to specific skills. There are eleven skills learned by our method. (zoom in for best view)
  • Figure 5: Visualization of skill heat map on LOReL. We display the word frequency associated with a skill set of size 20 in LOReL, normalized by column. The data's sparsity and distinct highlights indicate certain language tokens are uniquely linked to specific skills. There are eleven skills learned by our method.
  • ...and 9 more figures

Theorems & Definitions (6)

  • Theorem A.1
  • proof
  • Theorem A.2
  • proof
  • Theorem A.3
  • proof