Table of Contents
Fetching ...

Learning-based Cooperative Robotic Paper Wrapping: A Unified Control Policy with Residual Force Control

Rewida Ali, Cristian C. Beltran-Hernandez, Weiwei Wan, Kensuke Harada

TL;DR

This work tackles long-horizon manipulation of deformable materials, exemplified by gift-wrapping, in human-robot collaboration. It introduces a unified framework that blends an LLM-based high-level task planner, a Sub-task Aware Robotic Transformer (START), and a residual RL admittance controller to bridge high-level intent with fine-grained force control. Key contributions include the LLM-driven sub-task planning, a unified START policy conditioned on sub-task IDs, and a constrained residual policy for compliant execution, achieving 97% real-world success. The framework is validated on a UR3e platform with multi-view perception, demonstrating robust performance across variations and providing ablations that highlight the importance of each component. This approach promises practical impact for automatic and collaborative handling of deformable objects in industrial settings.

Abstract

Human-robot cooperation is essential in environments such as warehouses and retail stores, where workers frequently handle deformable objects like paper, bags, and fabrics. Coordinating robotic actions with human assistance remains difficult due to the unpredictable dynamics of deformable materials and the need for adaptive force control. To explore this challenge, we focus on the task of gift wrapping, which exemplifies a long-horizon manipulation problem involving precise folding, controlled creasing, and secure fixation of paper. Success is achieved when the robot completes the sequence to produce a neatly wrapped package with clean folds and no tears. We propose a learning-based framework that integrates a high-level task planner powered by a large language model (LLM) with a low-level hybrid imitation learning (IL) and reinforcement learning (RL) policy. At its core is a Sub-task Aware Robotic Transformer (START) that learns a unified policy from human demonstrations. The key novelty lies in capturing long-range temporal dependencies across the full wrapping sequence within a single model. Unlike vanilla Action Chunking with Transformer (ACT), typically applied to short tasks, our method introduces sub-task IDs that provide explicit temporal grounding. This enables robust performance across the entire wrapping process and supports flexible execution, as the policy learns sub-goals rather than merely replicating motion sequences. Our framework achieves a 97% success rate on real-world wrapping tasks. We show that the unified transformer-based policy reduces the need for specialized models, allows controlled human supervision, and effectively bridges high-level intent with the fine-grained force control required for deformable object manipulation.

Learning-based Cooperative Robotic Paper Wrapping: A Unified Control Policy with Residual Force Control

TL;DR

This work tackles long-horizon manipulation of deformable materials, exemplified by gift-wrapping, in human-robot collaboration. It introduces a unified framework that blends an LLM-based high-level task planner, a Sub-task Aware Robotic Transformer (START), and a residual RL admittance controller to bridge high-level intent with fine-grained force control. Key contributions include the LLM-driven sub-task planning, a unified START policy conditioned on sub-task IDs, and a constrained residual policy for compliant execution, achieving 97% real-world success. The framework is validated on a UR3e platform with multi-view perception, demonstrating robust performance across variations and providing ablations that highlight the importance of each component. This approach promises practical impact for automatic and collaborative handling of deformable objects in industrial settings.

Abstract

Human-robot cooperation is essential in environments such as warehouses and retail stores, where workers frequently handle deformable objects like paper, bags, and fabrics. Coordinating robotic actions with human assistance remains difficult due to the unpredictable dynamics of deformable materials and the need for adaptive force control. To explore this challenge, we focus on the task of gift wrapping, which exemplifies a long-horizon manipulation problem involving precise folding, controlled creasing, and secure fixation of paper. Success is achieved when the robot completes the sequence to produce a neatly wrapped package with clean folds and no tears. We propose a learning-based framework that integrates a high-level task planner powered by a large language model (LLM) with a low-level hybrid imitation learning (IL) and reinforcement learning (RL) policy. At its core is a Sub-task Aware Robotic Transformer (START) that learns a unified policy from human demonstrations. The key novelty lies in capturing long-range temporal dependencies across the full wrapping sequence within a single model. Unlike vanilla Action Chunking with Transformer (ACT), typically applied to short tasks, our method introduces sub-task IDs that provide explicit temporal grounding. This enables robust performance across the entire wrapping process and supports flexible execution, as the policy learns sub-goals rather than merely replicating motion sequences. Our framework achieves a 97% success rate on real-world wrapping tasks. We show that the unified transformer-based policy reduces the need for specialized models, allows controlled human supervision, and effectively bridges high-level intent with the fine-grained force control required for deformable object manipulation.

Paper Structure

This paper contains 23 sections, 7 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: System Overview: (a) Application Context: Wrapping as the final stage in an automated packaging (b) Key Challenges: Wrinkles, and tears.
  • Figure 2: Overview of the proposed framework: the system takes as input a natural language task description, multi-view RGB observations, robot proprioception, and force/torque data. START model predicts high-level actions using a unified policy. In parallel, a language-based task planner synchronizes collaboration with the human partner, while a residual RL module learns admittance control parameters for the compliant execution of precise action.
  • Figure 3: Task planner framework based on LLMs: it consists of two steps, first dividing the task description into steps using GPT. Then, by using Codex to generate executable robot commands by combining the steps with predefined primitives and the coordinates generated from the transformer-based learning model. In parallel, it also generates sub-task IDs.
  • Figure 4: Proposed framework for learning an IL task in a real robot using a modified task-aware START model.
  • Figure 5: The experimental hardware setup featuring a UR3e robotic arm with a Robotiq Hand-E gripper. Two Intel RealSense D435 cameras provide multi-view RGB perception.
  • ...and 2 more figures