Table of Contents
Fetching ...

NatSGLD: A Dataset with Speech, Gesture, Logic, and Demonstration for Robot Learning in Natural Human-Robot Interaction

Snehesh Shrestha, Yantian Zha, Saketh Banagiri, Ge Gao, Yiannis Aloimonos, Cornelia Fermüller

TL;DR

NatSGLD addresses the gap in multimodal HRI datasets by providing a real-time Unity-based simulator and a Wizard-of-Oz collected dataset that jointly records speech, gestures, LTL-grounded task representations, and expert demonstrations for kitchen-domain tasks. The framework maps natural multimodal commands to symbolic LTL formulas while capturing detailed robot trajectories, enabling research in multimodal instruction following, plan recognition, and human-advisable reinforcement learning from demonstrations. Its contributions include the first integration of speech, gestures, LTL annotations, and demonstrations in a unified resource, along with a playable simulator, data pipelines, and analysis tools. This dataset and platform support more natural and capable human-robot collaboration, offering tangible benefits for training and evaluating complex HRI systems across perception, planning, and control.

Abstract

Recent advances in multimodal Human-Robot Interaction (HRI) datasets emphasize the integration of speech and gestures, allowing robots to absorb explicit knowledge and tacit understanding. However, existing datasets primarily focus on elementary tasks like object pointing and pushing, limiting their applicability to complex domains. They prioritize simpler human command data but place less emphasis on training robots to correctly interpret tasks and respond appropriately. To address these gaps, we present the NatSGLD dataset, which was collected using a Wizard of Oz (WoZ) method, where participants interacted with a robot they believed to be autonomous. NatSGLD records humans' multimodal commands (speech and gestures), each paired with a demonstration trajectory and a Linear Temporal Logic (LTL) formula that provides a ground-truth interpretation of the commanded tasks. This dataset serves as a foundational resource for research at the intersection of HRI and machine learning. By providing multimodal inputs and detailed annotations, NatSGLD enables exploration in areas such as multimodal instruction following, plan recognition, and human-advisable reinforcement learning from demonstrations. We release the dataset and code under the MIT License at https://www.snehesh.com/natsgld/ to support future HRI research.

NatSGLD: A Dataset with Speech, Gesture, Logic, and Demonstration for Robot Learning in Natural Human-Robot Interaction

TL;DR

NatSGLD addresses the gap in multimodal HRI datasets by providing a real-time Unity-based simulator and a Wizard-of-Oz collected dataset that jointly records speech, gestures, LTL-grounded task representations, and expert demonstrations for kitchen-domain tasks. The framework maps natural multimodal commands to symbolic LTL formulas while capturing detailed robot trajectories, enabling research in multimodal instruction following, plan recognition, and human-advisable reinforcement learning from demonstrations. Its contributions include the first integration of speech, gestures, LTL annotations, and demonstrations in a unified resource, along with a playable simulator, data pipelines, and analysis tools. This dataset and platform support more natural and capable human-robot collaboration, offering tangible benefits for training and evaluating complex HRI systems across perception, planning, and control.

Abstract

Recent advances in multimodal Human-Robot Interaction (HRI) datasets emphasize the integration of speech and gestures, allowing robots to absorb explicit knowledge and tacit understanding. However, existing datasets primarily focus on elementary tasks like object pointing and pushing, limiting their applicability to complex domains. They prioritize simpler human command data but place less emphasis on training robots to correctly interpret tasks and respond appropriately. To address these gaps, we present the NatSGLD dataset, which was collected using a Wizard of Oz (WoZ) method, where participants interacted with a robot they believed to be autonomous. NatSGLD records humans' multimodal commands (speech and gestures), each paired with a demonstration trajectory and a Linear Temporal Logic (LTL) formula that provides a ground-truth interpretation of the commanded tasks. This dataset serves as a foundational resource for research at the intersection of HRI and machine learning. By providing multimodal inputs and detailed annotations, NatSGLD enables exploration in areas such as multimodal instruction following, plan recognition, and human-advisable reinforcement learning from demonstrations. We release the dataset and code under the MIT License at https://www.snehesh.com/natsgld/ to support future HRI research.

Paper Structure

This paper contains 38 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: NatSGLD contains speech, gestures, temporal logic annotations, and demonstration trajectories for everyday food preparing, cooking, and cleaning tasks. The NatSGLD dataset enables the learning of complex human-robot interaction tasks due to the rich interaction modalities and strong supervising signals at both trajectory-level (demonstrations) and symbolic-level (ground-truth temporal logic formulas that match humans' expectation on the robot).
  • Figure 2: Illustration of Our Simulator (Baxter Opens Pot Lid): Our simulator provides a comprehensive, multi-view perspective of the robot and its environment, captured through both static and mobile cameras. The top row includes three camera angles: the human's first-person view, an overhead shot (top left), and two views from the kitchen counter (top right and left). The bottom row presents the robot's egocentric view, featuring multiple sensor outputs: RGB, depth images, distinct object segmentation map, and category-based semantic segmentation map. These varied perspectives enable the robot to interpret and perform tasks via human speech and gesture commands, as well as autonomously evaluate its surroundings and the state of objects.
  • Figure 3: NatSGLD experiment setup (top view). The participant points at the pot on the right stove and asks the robot to boil water, implying the right burner should be turned on. The bottom-left corner shows the location of the hidden Wizard of Oz, who observes the participant and controls the robot.
  • Figure 4: In this example, the participants are instructing the robot to cut onions where the participant's natural choice of communication is diverse.
  • Figure 5: The learning framework for translating a pair of speech and gesture data to an LTL formula that can solve multi-modal human task understanding problems.