Preference Aligned Visuomotor Diffusion Policies for Deformable Object Manipulation

Marco Moletta; Michael C. Welle; Danica Kragic

Preference Aligned Visuomotor Diffusion Policies for Deformable Object Manipulation

Marco Moletta, Michael C. Welle, Danica Kragic

TL;DR

Problem: aligning pretrained visuomotor diffusion policies to user-specific preferences in deformable object manipulation, particularly garment folding, with limited demonstrations. Approach: introduce Diffusion-RKO, a preference-alignment method that combines RPO-style context-aware weighting with KTO-style per-sample feedback, and compare it to Diffusion-DPO, Diffusion-RPO, Diffusion-KTO, and a vanilla DDPM. Contributions: systematic comparison of DPO, RPO, and KTO for diffusion policies in DOM, introduction of RKO, and real-world cloth-folding experiments showing improved performance and sample efficiency. Significance: demonstrates practical, scalable personalization of robot cloth-folding behavior with limited data, enabling user-specific styles and preferences.

Abstract

Humans naturally develop preferences for how manipulation tasks should be performed, which are often subtle, personal, and difficult to articulate. Although it is important for robots to account for these preferences to increase personalization and user satisfaction, they remain largely underexplored in robotic manipulation, particularly in the context of deformable objects like garments and fabrics. In this work, we study how to adapt pretrained visuomotor diffusion policies to reflect preferred behaviors using limited demonstrations. We introduce RKO, a novel preference-alignment method that combines the benefits of two recent frameworks: RPO and KTO. We evaluate RKO against common preference learning frameworks, including these two, as well as a baseline vanilla diffusion policy, on real-world cloth-folding tasks spanning multiple garments and preference settings. We show that preference-aligned policies (particularly RKO) achieve superior performance and sample efficiency compared to standard diffusion policy fine-tuning. These results highlight the importance and feasibility of structured preference learning for scaling personalized robot behavior in complex deformable object manipulation tasks.

Preference Aligned Visuomotor Diffusion Policies for Deformable Object Manipulation

TL;DR

Abstract

Paper Structure (17 sections, 10 equations, 5 figures, 3 tables)

This paper contains 17 sections, 10 equations, 5 figures, 3 tables.

Introduction
Related Work
Learning Deformable Object Manipulation Skills
Preference Alignment in Robotics
Method
Preliminaries: Visuomotor diffusion models
Preference Alignment frameworks
Diffusion-DPO
Diffusion-RPO
Diffusion-KTO
Diffusion-RKO
Similarity reweighting and convergence intuition
Dataset Generation
Experiments
Experimental settings and implementation details
...and 2 more sections

Figures (5)

Figure 1: Two different user preferences for folding the same garments demonstrate how variations in execution can reflect personal styles or practical needs. Capturing and aligning with such preferences is essential for enabling robots to perform personalized and user-aligned behaviors in deformable object manipulation tasks like cloth folding.
Figure 2: General preference alignment framework used in this work. A reference model is first trained on a large set of demonstrations ($D_{\text{ref}}$) for a given task. To align it to a user’s preferred strategy, a new set of winning demonstrations is collected and combined into $D_{\text{pref}}$ along with losing demonstrations, i.e., examples of alternative, dispreferred behaviors (which may also come from $D_{\text{ref}}$). The preference loss then aligns the new policy $\pi_{\theta}$ to the preferred behavior by explicitly contrasting it with the losing demonstrations. This enables more effective and sample-efficient alignment than training a diffusion policy solely on the winning demonstrations.
Figure 3: Illustration of three garment-folding preferences for each garment type. Each panel shows the pick (circle) and place (diamond) positions for the left (orange) and right (light blue) arms. The scores are visible on the bottom right: bimanual actions are executed synchronously, and their scores are normalized so that each individual action contributes equally. The total score for a complete, correct folding sequence is $1$.
Figure 4: Experimental setup.
Figure 5: Sample efficiency experiment: influence of the number of demonstrations ($20$ to $95$winning demos, increments of $15$) on the performance, in Trousers - Pref 1 and Sleeves - Pref 1.

Preference Aligned Visuomotor Diffusion Policies for Deformable Object Manipulation

TL;DR

Abstract

Preference Aligned Visuomotor Diffusion Policies for Deformable Object Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)