SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation

Xin Li; Siyuan Huang; Qiaojun Yu; Zhengkai Jiang; Ce Hao; Yimeng Zhu; Hongsheng Li; Peng Gao; Cewu Lu

SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation

Xin Li, Siyuan Huang, Qiaojun Yu, Zhengkai Jiang, Ce Hao, Yimeng Zhu, Hongsheng Li, Peng Gao, Cewu Lu

TL;DR

Experimental results indicate that the VLM-based method significantly enhances keypoint detection accuracy and task success rates, providing a more flexible and general solution for robotic garment manipulation.

Abstract

Automating garment manipulation poses a significant challenge for assistive robotics due to the diverse and deformable nature of garments. Traditional approaches typically require separate models for each garment type, which limits scalability and adaptability. In contrast, this paper presents a unified approach using vision-language models (VLMs) to improve keypoint prediction across various garment categories. By interpreting both visual and semantic information, our model enables robots to manage different garment states with a single model. We created a large-scale synthetic dataset using advanced simulation techniques, allowing scalable training without extensive real-world data. Experimental results indicate that the VLM-based method significantly enhances keypoint detection accuracy and task success rates, providing a more flexible and general solution for robotic garment manipulation. In addition, this research also underscores the potential of VLMs to unify various garment manipulation tasks within a single framework, paving the way for broader applications in home automation and assistive robotics for future.

SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation

TL;DR

Abstract

Paper Structure (25 sections, 5 figures, 2 tables)

This paper contains 25 sections, 5 figures, 2 tables.

Introduction
Related Work
Robotic Garment Manipulation
Synthetic Data for Robotic Garment Manipulation
Dense Representations for Garment Manipulation
Method
Synthetic Dataset Generation
Garment Mesh Generation
Garment Mesh Deformation
Garment Image Generation
Keypoint Generation
Paired Keypoint Representation
Action Tuple Trajectory Generation
Vision Language Model Fine-Tuning
Model Architecture
...and 10 more sections

Figures (5)

Figure 1: Comparison of Keypoint Detection Methods. The previous method lips2024learning struggles with deformed or ambiguous garment states, leading to inconsistent and incomplete keypoint predictions. In contrast, our SKT utilizing state-aware paired keypoints and vision-language models (VLMs), achieves more robust and accurate keypoint detection, improving generalization across flat, folded, and deformed garment configurations.
Figure 2: (a)The overall framework of state-aware keypoint trajectory (SKT). SKT generates action trajectories for clothes manipulation by leveraging a fine-tuned vision-language model for state-aware paired keypoint and action generation through the action decoder (b).
Figure 3: A sample set comprises synthetic images depicting different fold states with corresponding paired action keypoints annotations.
Figure 4: Qualitative visual comparison. (a) The previous approach lips2024learning struggles to handle deformed or ambiguous garment states, often resulting in incomplete and inconsistent keypoint predictions. (b) In contrast, our method provides more robust and accurate keypoint detection across diverse garment configurations, as demonstrated through improved visualization and performance.
Figure 5: Evaluation of the SKT's performance on manually collected unseen data, including long pants, various folding states with deformations, and long sleeves. The results demonstrate robust handling of long pants and folded garments, while challenges remain in generalizing to unseen long sleeves and complex folded sleeve configurations.

SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation

TL;DR

Abstract

SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)