Learning-based Cooperative Robotic Paper Wrapping: A Unified Control Policy with Residual Force Control
Rewida Ali, Cristian C. Beltran-Hernandez, Weiwei Wan, Kensuke Harada
TL;DR
This work tackles long-horizon manipulation of deformable materials, exemplified by gift-wrapping, in human-robot collaboration. It introduces a unified framework that blends an LLM-based high-level task planner, a Sub-task Aware Robotic Transformer (START), and a residual RL admittance controller to bridge high-level intent with fine-grained force control. Key contributions include the LLM-driven sub-task planning, a unified START policy conditioned on sub-task IDs, and a constrained residual policy for compliant execution, achieving 97% real-world success. The framework is validated on a UR3e platform with multi-view perception, demonstrating robust performance across variations and providing ablations that highlight the importance of each component. This approach promises practical impact for automatic and collaborative handling of deformable objects in industrial settings.
Abstract
Human-robot cooperation is essential in environments such as warehouses and retail stores, where workers frequently handle deformable objects like paper, bags, and fabrics. Coordinating robotic actions with human assistance remains difficult due to the unpredictable dynamics of deformable materials and the need for adaptive force control. To explore this challenge, we focus on the task of gift wrapping, which exemplifies a long-horizon manipulation problem involving precise folding, controlled creasing, and secure fixation of paper. Success is achieved when the robot completes the sequence to produce a neatly wrapped package with clean folds and no tears. We propose a learning-based framework that integrates a high-level task planner powered by a large language model (LLM) with a low-level hybrid imitation learning (IL) and reinforcement learning (RL) policy. At its core is a Sub-task Aware Robotic Transformer (START) that learns a unified policy from human demonstrations. The key novelty lies in capturing long-range temporal dependencies across the full wrapping sequence within a single model. Unlike vanilla Action Chunking with Transformer (ACT), typically applied to short tasks, our method introduces sub-task IDs that provide explicit temporal grounding. This enables robust performance across the entire wrapping process and supports flexible execution, as the policy learns sub-goals rather than merely replicating motion sequences. Our framework achieves a 97% success rate on real-world wrapping tasks. We show that the unified transformer-based policy reduces the need for specialized models, allows controlled human supervision, and effectively bridges high-level intent with the fine-grained force control required for deformable object manipulation.
