Table of Contents
Fetching ...

SigmaCollab: An Application-Driven Dataset for Physically Situated Collaboration

Dan Bohus, Sean Andrist, Ann Paradiso, Nick Saw, Tim Schoonbeek, Maia Stiber

TL;DR

SigmaCollab addresses the need for ecologically valid research on physically situated human–AI collaboration by providing an interactive, multimodal dataset captured as participants use a mixed-reality assistant to complete diverse procedures. The dataset is collected with the open-source Sigma platform on HoloLens 2, incorporating audio, egocentric video, depth, gaze, and pose data, plus post-hoc annotations. It enables benchmarks that test real-time coordination, grounding, and cognitive-state understanding in end-to-end tasks, moving beyond static, non-interactive datasets. The work demonstrates the feasibility of application-driven data collection and highlights opportunities to study proactive interventions, self-talk, and end-to-end system performance in realistic settings.

Abstract

We introduce SigmaCollab, a dataset enabling research on physically situated human-AI collaboration. The dataset consists of a set of 85 sessions in which untrained participants were guided by a mixed-reality assistive AI agent in performing procedural tasks in the physical world. SigmaCollab includes a set of rich, multimodal data streams, such as the participant and system audio, egocentric camera views from the head-mounted device, depth maps, head, hand and gaze tracking information, as well as additional annotations performed post-hoc. While the dataset is relatively small in size (~ 14 hours), its application-driven and interactive nature brings to the fore novel research challenges for human-AI collaboration, and provides more realistic testing grounds for various AI models operating in this space. In future work, we plan to use the dataset to construct a set of benchmarks for physically situated collaboration in mixed-reality task assistive scenarios. SigmaCollab is available at https://github.com/microsoft/SigmaCollab.

SigmaCollab: An Application-Driven Dataset for Physically Situated Collaboration

TL;DR

SigmaCollab addresses the need for ecologically valid research on physically situated human–AI collaboration by providing an interactive, multimodal dataset captured as participants use a mixed-reality assistant to complete diverse procedures. The dataset is collected with the open-source Sigma platform on HoloLens 2, incorporating audio, egocentric video, depth, gaze, and pose data, plus post-hoc annotations. It enables benchmarks that test real-time coordination, grounding, and cognitive-state understanding in end-to-end tasks, moving beyond static, non-interactive datasets. The work demonstrates the feasibility of application-driven data collection and highlights opportunities to study proactive interventions, self-talk, and end-to-end system performance in realistic settings.

Abstract

We introduce SigmaCollab, a dataset enabling research on physically situated human-AI collaboration. The dataset consists of a set of 85 sessions in which untrained participants were guided by a mixed-reality assistive AI agent in performing procedural tasks in the physical world. SigmaCollab includes a set of rich, multimodal data streams, such as the participant and system audio, egocentric camera views from the head-mounted device, depth maps, head, hand and gaze tracking information, as well as additional annotations performed post-hoc. While the dataset is relatively small in size (~ 14 hours), its application-driven and interactive nature brings to the fore novel research challenges for human-AI collaboration, and provides more realistic testing grounds for various AI models operating in this space. In future work, we plan to use the dataset to construct a set of benchmarks for physically situated collaboration in mixed-reality task assistive scenarios. SigmaCollab is available at https://github.com/microsoft/SigmaCollab.

Paper Structure

This paper contains 25 sections, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Left: user performing a procedural task with a mixed-reality headset running Sigma. Middle: first-person view showing Sigma guidance panel and task-specific holograms. Right: visualization of system's scene understanding showing the egocentric camera view, depth map, detected objects, gaze, hand and head pose in 3D space. © 2024 IEEE
  • Figure 2: Illustration of the various data streams available in the SigmaCollab dataset. Top of the image has the time axis and the blue vertical line represents the current time position for the image streams. (A) Color camera image; (B) 3D color and depth camera poses, head pose, gaze direction, hands, and mixed reality UI positioning; (C) left grayscale camera image; (D) right grayscale camera image; (E) depth image; (F) temporal step execution boundaries; (G) temporal sub-step execution boundaries; (H) audio; (I) system utterances; (J) participant utterances; (K) temporal intervals in which the participant is gazing at the mixed reality user interface panel. The red dots in (A), (C), (D), and (E) show the corresponding projected gaze points.
  • Figure 3: Data collection study participant demographics.
  • Figure 4: Dataset statistics.
  • Figure :
  • ...and 7 more figures