QUB-PHEO: A Visual-Based Dyadic Multi-View Dataset for Intention Inference in Collaborative Assembly

Samuel Adebayo; Seán McLoone; Joost C. Dessing

QUB-PHEO: A Visual-Based Dyadic Multi-View Dataset for Intention Inference in Collaborative Assembly

Samuel Adebayo, Seán McLoone, Joost C. Dessing

TL;DR

QUB-PHEO tackles the need for rich, multi-view data to infer human intentions in collaborative assembly by introducing a five-camera dyadic dataset where a human acts as a robot surrogate. The dataset comprises 70 participants (50 with full video data) and 36 subtasks, with dense visual annotations including facial landmarks, gaze, hand movements, and object bounding boxes, enabling fine-grained intention inference. The authors describe an end-to-end pipeline—calibration with Charuco boards, gaze mapping inspired by GazeScape, Label Studio-based annotation, and a YOLOv8-based object detector—delivering high-quality multi-view data (4.5 million frames, 36 hours of video). They also provide a formal framework for subtask classification and next-subtask inference, along with pathways for broader CV and HRI applications, under an EULA to foster community contributions and real-world impact.

Abstract

QUB-PHEO introduces a visual-based, dyadic dataset with the potential of advancing human-robot interaction (HRI) research in assembly operations and intention inference. This dataset captures rich multimodal interactions between two participants, one acting as a 'robot surrogate,' across a variety of assembly tasks that are further broken down into 36 distinct subtasks. With rich visual annotations, such as facial landmarks, gaze, hand movements, object localization, and more for 70 participants, QUB-PHEO offers two versions: full video data for 50 participants and visual cues for all 70. Designed to improve machine learning models for HRI, QUB-PHEO enables deeper analysis of subtle interaction cues and intentions, promising contributions to the field. The dataset will be available at https://github.com/exponentialR/QUB-PHEO subject to an End-User License Agreement (EULA).

QUB-PHEO: A Visual-Based Dyadic Multi-View Dataset for Intention Inference in Collaborative Assembly

TL;DR

Abstract

Paper Structure (30 sections, 2 equations, 14 figures, 4 tables)

This paper contains 30 sections, 2 equations, 14 figures, 4 tables.

Introduction
Background and Motivation
Contribution
Related Work
General HRI Datasets
Human-Robot Interaction Simulators
Multimodal HRI Datasets
Methodology
Participant Recruitment and Ethics Approval
Experimental Setup
Task Definition
Subtask Identification
Data Collection and Preprocessing
Calibration and Synchronization
Gaze Estimation and Mapping
...and 15 more sections

Figures (14)

Figure 1: Variety and perspectives in QUB-PHEO: showcasing diverse participant engagement in assembly tasks, captured from multiple camera views to facilitate comprehensive analysis of human-robot interaction.
Figure 2: High-level workflow of the QUB-PHEO dataset development: This flow outlines the systematic approach employed from initial data collection to final dataset generation. Starting with raw video inputs, the workflow incorporates systematic video annotation and rigorous quality checks during preprocessing. The feature extraction stage is then coupled with parallel annotation processes, culminating in the output dataset that integrates corrected video data with feature annotations.
Figure 3: Schematic view of the experimental setup for the QUB-PHEO dataset: This setup depicts a multi-view data collection strategy, employing an aerial view camera ($CAM\_AV$) and four additional cameras at strategic points: lower left ($CAM\_LL$), lower right ($CAM\_LR$), upper left ($CAM\_UL$), and upper right ($CAM\_UR$) to capture the interactions between a human participant ($PXX$) and the robot surrogate ($RS$).
Figure 4: Pictorial view of our experimental setup for data collection highlighting camera locations: $CAM\_UL$ (upper left), $CAM\_UR$ (upper right), $CAM\_LL$ (lower left), $CAM\_LR$ (lower right) and $CAM\_AV$ (aerial view), the robot surrogate's position ($RS$), and the participant's position ($PXX$).
Figure 5: Visual representations of the tasks in the dataset designed to capture a broad range of human interactions: (a) Bridge, (b) Stairway Shuffle, (c) Simple Tower, and (d) Block in a Hole. Each task has variations that challenge different interaction skills.
...and 9 more figures

QUB-PHEO: A Visual-Based Dyadic Multi-View Dataset for Intention Inference in Collaborative Assembly

TL;DR

Abstract

QUB-PHEO: A Visual-Based Dyadic Multi-View Dataset for Intention Inference in Collaborative Assembly

Authors

TL;DR

Abstract

Table of Contents

Figures (14)