Table of Contents
Fetching ...

Robustifying Long-term Human-Robot Collaboration through a Multimodal and Hierarchical Framework

Peiqi Yu, Abulikemu Abuduweili, Ruixuan Liu, Changliu Liu

TL;DR

This work tackles robust, long-horizon human-robot collaboration by modeling tasks with a hierarchical graph and delivering a multimodal, hierarchical framework that fuses vision and speech signals. It introduces hierarchical pose and plan prediction, online adaptation, and a real-time robot controller to enable proactive, user-specific assistance across extended assembly tasks. Key contributions include a formal mutual-information justification for multimodal fusion, a DTW-based plan alignment mechanism, and extensive real-world validation showing improved task success, reduced disturbances, and higher user satisfaction. The approach holds significant practical impact for flexible manufacturing and assistive robotics in everyday environments by enhancing robustness, efficiency, and user experience in long-term HRC.

Abstract

Long-term Human-Robot Collaboration (HRC) is crucial for enabling flexible manufacturing systems and integrating companion robots into daily human environments over extended periods. This paper identifies several key challenges for such collaborations, such as accurate recognition of human plan, robustness to disturbances, operational efficiency, adaptability to diverse user behaviors, and sustained human satisfaction. To address these challenges, we model the long-term HRC task through a hierarchical task graph and presents a novel multimodal and hierarchical framework to enable robots to better assist humans to advance on the task graph. In particular, the proposed multimodal framework integrates visual observations with speech commands to facilitate intuitive and flexible human-robot interactions. Additionally, our hierarchical designs for both human pose detection and plan prediction allow better understanding of human behaviors and significantly enhance system accuracy, robustness and flexibility. Moreover, an online adaptation mechanism enables real-time adjustment to diverse user behaviors. We deploy the proposed framework to KINOVA GEN3 robot and conduct extensive user studies on real-world long-term HRC assembly scenarios. Experimental results show that our approaches reduce task completion time by 15.9%, achieves an average task success rate of 91.8% and an overall user satisfaction score of 84% in long-term HRC tasks, showcasing its applicability in enhancing real-world long-term HRC.

Robustifying Long-term Human-Robot Collaboration through a Multimodal and Hierarchical Framework

TL;DR

This work tackles robust, long-horizon human-robot collaboration by modeling tasks with a hierarchical graph and delivering a multimodal, hierarchical framework that fuses vision and speech signals. It introduces hierarchical pose and plan prediction, online adaptation, and a real-time robot controller to enable proactive, user-specific assistance across extended assembly tasks. Key contributions include a formal mutual-information justification for multimodal fusion, a DTW-based plan alignment mechanism, and extensive real-world validation showing improved task success, reduced disturbances, and higher user satisfaction. The approach holds significant practical impact for flexible manufacturing and assistive robotics in everyday environments by enhancing robustness, efficiency, and user experience in long-term HRC.

Abstract

Long-term Human-Robot Collaboration (HRC) is crucial for enabling flexible manufacturing systems and integrating companion robots into daily human environments over extended periods. This paper identifies several key challenges for such collaborations, such as accurate recognition of human plan, robustness to disturbances, operational efficiency, adaptability to diverse user behaviors, and sustained human satisfaction. To address these challenges, we model the long-term HRC task through a hierarchical task graph and presents a novel multimodal and hierarchical framework to enable robots to better assist humans to advance on the task graph. In particular, the proposed multimodal framework integrates visual observations with speech commands to facilitate intuitive and flexible human-robot interactions. Additionally, our hierarchical designs for both human pose detection and plan prediction allow better understanding of human behaviors and significantly enhance system accuracy, robustness and flexibility. Moreover, an online adaptation mechanism enables real-time adjustment to diverse user behaviors. We deploy the proposed framework to KINOVA GEN3 robot and conduct extensive user studies on real-world long-term HRC assembly scenarios. Experimental results show that our approaches reduce task completion time by 15.9%, achieves an average task success rate of 91.8% and an overall user satisfaction score of 84% in long-term HRC tasks, showcasing its applicability in enhancing real-world long-term HRC.

Paper Structure

This paper contains 28 sections, 11 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Overview of the hierarchical task graph for the long-term toy car assembly HRC task. Each node in the long-term graph (left) corresponds to a short-term subtask (right), where the human and robot collaborate through sequential actions. The task consists of four stages and each stage is completed through a series of task nodes (e.g., must complete both Node 1 and Node 2 to complete The Bottom Stage ). The task starts at Node 1 and finishes at Node 7, with the directed arrows in the long-term task graph representing the temporal sequence of task execution (e.g., must complete Node 1 before performing Node 2). Both high-level task planning and short-term execution involve uncertainties. In the subtask graph, dashed lines indicate collaborative actions, while solid lines represent actions performed by a single agent. Arrows denote task progression. Arrows with a cross indicate robot motions that were initially planned but were not executed, as the robot adapted to human actions and switched to an alternative motion. For clarity, certain actions are grouped or omitted without affecting the task’s overall structure.
  • Figure 2: The architecture of the proposed HRC framework.
  • Figure 3: Final Assembly Task
  • Figure 4: Environment Setting
  • Figure 5: Misdetection in naive human pose detection model.
  • ...and 7 more figures