Iterative On-Policy Refinement of Hierarchical Diffusion Policies for Language-Conditioned Manipulation

Clemence Grislain; Olivier Sigaud; Mohamed Chetouani

Iterative On-Policy Refinement of Hierarchical Diffusion Policies for Language-Conditioned Manipulation

Clemence Grislain, Olivier Sigaud, Mohamed Chetouani

TL;DR

Hierarchical policies for language-conditioned manipulation decompose tasks into subgoals, where a high-level planner guides a low-level controller, and HD-ExpIt, a framework for iterative fine-tuning of hierarchical diffusion policies via environment feedback, is proposed.

Abstract

Hierarchical policies for language-conditioned manipulation decompose tasks into subgoals, where a high-level planner guides a low-level controller. However, these hierarchical agents often fail because the planner generates subgoals without considering the actual limitations of the controller. Existing solutions attempt to bridge this gap via intermediate modules or shared representations, but they remain limited by their reliance on fixed offline datasets. We propose HD-ExpIt, a framework for iterative fine-tuning of hierarchical diffusion policies via environment feedback. HD-ExpIt organizes training into a self-reinforcing cycle: it utilizes diffusion-based planning to autonomously discover successful behaviors, which are then distilled back into the hierarchical policy. This loop enables both components to improve while implicitly grounding the planner in the controller's actual capabilities without requiring explicit proxy models. Empirically, HD-ExpIt significantly improves hierarchical policies trained solely on offline data, achieving state-of-the-art performance on the long-horizon CALVIN benchmark among methods trained from scratch.

Iterative On-Policy Refinement of Hierarchical Diffusion Policies for Language-Conditioned Manipulation

TL;DR

Abstract

Paper Structure (36 sections, 5 equations, 14 figures, 9 tables, 4 algorithms)

This paper contains 36 sections, 5 equations, 14 figures, 9 tables, 4 algorithms.

Introduction
Related Work
Hierarchical policies for language-conditioned manipulation
Fine-tuning diffusion-based and hierarchical policies with environment feedback
Problem Statement
Method
Hierarchical Diffusion Policy
Training
Policy Update with Supervised training
Rollouts Collection
Datasets aggregation
Experimental Setup
Architectures
Environments
Baselines
...and 21 more sections

Figures (14)

Figure 1: HD-ExpIt compared with common hierarchical policy training paradigms.Left: Existing strategies for training hierarchical policies from a fixed, offline dataset $D_0$: (a) independent supervised training of HL and LL; (b) integration of an intermediate "glue" model to bridge planning and control; and (c) joint training via shared cross-level representations. Right(d): The proposed HD-ExpIt framework utilizes an iterative refinement cycle: (1) independent supervised updates of the policy components from the current dataset $D_t$; (2) on-policy rollout collection where the diffusion planner’s stochasticity serves as a generative search mechanism to discover successful trajectories, which are filtered based on environment feedback to capture LL's actual capabilities; and (3) dataset aggregation where these successful trajectories $\mathcal{R}_t$ are either added to or used as the training set for the next iteration. Far Right: Detail of the hierarchical interaction during data collection, where for contexts in $\mathcal{C}(D_t)$ HL generates $K$ subgoal sequences $\hat{\zeta}$ executed by LL via action chunks $a_c$.
Figure 2: HD-ExpIt improves HD policies to achieve SOTA results on CALVIN MTLC. Mean success rate across tasks of the HD policy trained on $D_0$ and after three iterations of each version of HD-ExpIt compared to baselines ($*$ for results from previous work, bars indicates standard error over 3 seeds for ours and re-evaluated methods).
Figure 3: All variants of HD-ExpIt significantly improve HD policies in both the Franka-3Blocks and CALVIN environments. Performance of HD policies is shown after 0 (trained solely on $D_0$), to 3 iterations of variants of HD-ExpIt. Scatter represent the mean metric, while shaded areas denote the standard error across 3 seeds. (a) Mean success rate across tasks in Franka-3Blocks. (b) Mean successful sequence length in CALVIN LH-MTLC.
Figure 4: Environments
Figure 5: Training hyperparameters for both environments and the different component architectures.
...and 9 more figures

Iterative On-Policy Refinement of Hierarchical Diffusion Policies for Language-Conditioned Manipulation

TL;DR

Abstract

Iterative On-Policy Refinement of Hierarchical Diffusion Policies for Language-Conditioned Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (14)