Table of Contents
Fetching ...

ClothPPO: A Proximal Policy Optimization Enhancing Framework for Robotic Cloth Manipulation with Observation-Aligned Action Spaces

Libing Yang, Yang Li, Long Chen

TL;DR

ClothPPO tackles vision-based cloth unfolding under partial observability with a very large action space by introducing an observation-aligned pixel-space policy (OBAP) that uses rotated and scaled spatial action maps to produce ~$10^6$ actions. It combines self-supervised pre-training of a UNet-based policy with PPO-based finetuning using a clipped surrogate objective $L^{PPO}$ to optimize long-horizon rewards, and employs a reward design that ties cloth coverage to learning signals via $\tilde{r}_t$ normalization. The approach demonstrates strong performance and generalization across unseen garment types in Cloth Action Gym, achieving state-of-the-art results and offering a scalable framework for policy-based control in deformable-object manipulation. This work highlights a practical path to leveraging high-dimensional, pixel-space actions in robotics by coupling self-supervised initialization with PPO refinement and observation-aligned action sampling, enabling robust, data-efficient cloth unfolding.

Abstract

Vision-based robotic cloth unfolding has made great progress recently. However, prior works predominantly rely on value learning and have not fully explored policy-based techniques. Recently, the success of reinforcement learning on the large language model has shown that the policy gradient algorithm can enhance policy with huge action space. In this paper, we introduce ClothPPO, a framework that employs a policy gradient algorithm based on actor-critic architecture to enhance a pre-trained model with huge 10^6 action spaces aligned with observation in the task of unfolding clothes. To this end, we redefine the cloth manipulation problem as a partially observable Markov decision process. A supervised pre-training stage is employed to train a baseline model of our policy. In the second stage, the Proximal Policy Optimization (PPO) is utilized to guide the supervised model within the observation-aligned action space. By optimizing and updating the strategy, our proposed method increases the garment's surface area for cloth unfolding under the soft-body manipulation task. Experimental results show that our proposed framework can further improve the unfolding performance of other state-of-the-art methods.

ClothPPO: A Proximal Policy Optimization Enhancing Framework for Robotic Cloth Manipulation with Observation-Aligned Action Spaces

TL;DR

ClothPPO tackles vision-based cloth unfolding under partial observability with a very large action space by introducing an observation-aligned pixel-space policy (OBAP) that uses rotated and scaled spatial action maps to produce ~ actions. It combines self-supervised pre-training of a UNet-based policy with PPO-based finetuning using a clipped surrogate objective to optimize long-horizon rewards, and employs a reward design that ties cloth coverage to learning signals via normalization. The approach demonstrates strong performance and generalization across unseen garment types in Cloth Action Gym, achieving state-of-the-art results and offering a scalable framework for policy-based control in deformable-object manipulation. This work highlights a practical path to leveraging high-dimensional, pixel-space actions in robotics by coupling self-supervised initialization with PPO refinement and observation-aligned action sampling, enabling robust, data-efficient cloth unfolding.

Abstract

Vision-based robotic cloth unfolding has made great progress recently. However, prior works predominantly rely on value learning and have not fully explored policy-based techniques. Recently, the success of reinforcement learning on the large language model has shown that the policy gradient algorithm can enhance policy with huge action space. In this paper, we introduce ClothPPO, a framework that employs a policy gradient algorithm based on actor-critic architecture to enhance a pre-trained model with huge 10^6 action spaces aligned with observation in the task of unfolding clothes. To this end, we redefine the cloth manipulation problem as a partially observable Markov decision process. A supervised pre-training stage is employed to train a baseline model of our policy. In the second stage, the Proximal Policy Optimization (PPO) is utilized to guide the supervised model within the observation-aligned action space. By optimizing and updating the strategy, our proposed method increases the garment's surface area for cloth unfolding under the soft-body manipulation task. Experimental results show that our proposed framework can further improve the unfolding performance of other state-of-the-art methods.
Paper Structure (33 sections, 8 equations, 5 figures, 2 tables)

This paper contains 33 sections, 8 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of ClothPPO. The first phase canberk2022clothfunnels involves self-supervised pre-training, which uses data from repeated actions in the environment to collect labels. The model estimates canonicalized alignment grasping label and selects the maximum estimated value to output the action. We introduce a long-term reward mechanism to improve the model's performance in goal-oriented tasks in the PPO training phase.
  • Figure 2: SPM: Our action spaces and sampling action using spatial policy maps. The series of smaller maps to the bottom are different slices of the spatial policy maps, each representing a layer at a different scale and rotation. The masks applied to each layer serve to filter out invalid actions those that would result in the robot's end-effector interacting with space or areas beyond the cloth. The variation in the sizes of the masks corresponds to different scales, affecting the size and granularity of the actions that can be sampled.
  • Figure 3: Comparing ClothPPO to PPO From Scratch. The dotted and solid lines represent the original and 0.2-smoothed data respectively. The red line corresponds to ClothPPO, while the blue line represents PPO From Scratch. ClothPPO performance (red line) demonstrates a superior mean best coverage across the number of steps, indicating enhanced task performance.
  • Figure 4: Reward Ablation. We compare three rewards: Threshold Achievement Reward (pink line): ends an episode when coverage exceeds 0.95. Its design enhances computational efficiency and motivates the model to complete tasks quickly. Over Achievement Reward (blue line): Despite achieving a coverage greater than 0.95, this function continues the episode. Immediate Termination Reward (green line): provides a reward and ends the episode as soon as coverage surpasses 0.95. Shaded areas show differences in multiple training experiments
  • Figure 5: Critic loss comparison. Using the reward scale as reward normalization contributes positively to the learning performance, enhancing stability and efficiency, as evidenced by the critic's lower and more stable loss values.