Table of Contents
Fetching ...

NatSGD: A Dataset with Speech, Gestures, and Demonstrations for Robot Learning in Natural Human-Robot Interaction

Snehesh Shrestha, Yantian Zha, Saketh Banagiri, Ge Gao, Yiannis Aloimonos, Cornelia Fermuller

TL;DR

NatSGD serves as a foundational resource at the intersection of machine learning and HRI research, and it is demonstrated its effectiveness in training robots to understand tasks through multimodal human commands, emphasizing the significance of jointly considering speech and gestures.

Abstract

Recent advancements in multimodal Human-Robot Interaction (HRI) datasets have highlighted the fusion of speech and gesture, expanding robots' capabilities to absorb explicit and implicit HRI insights. However, existing speech-gesture HRI datasets often focus on elementary tasks, like object pointing and pushing, revealing limitations in scaling to intricate domains and prioritizing human command data over robot behavior records. To bridge these gaps, we introduce NatSGD, a multimodal HRI dataset encompassing human commands through speech and gestures that are natural, synchronized with robot behavior demonstrations. NatSGD serves as a foundational resource at the intersection of machine learning and HRI research, and we demonstrate its effectiveness in training robots to understand tasks through multimodal human commands, emphasizing the significance of jointly considering speech and gestures. We have released our dataset, simulator, and code to facilitate future research in human-robot interaction system learning; access these resources at https://www.snehesh.com/natsgd/

NatSGD: A Dataset with Speech, Gestures, and Demonstrations for Robot Learning in Natural Human-Robot Interaction

TL;DR

NatSGD serves as a foundational resource at the intersection of machine learning and HRI research, and it is demonstrated its effectiveness in training robots to understand tasks through multimodal human commands, emphasizing the significance of jointly considering speech and gestures.

Abstract

Recent advancements in multimodal Human-Robot Interaction (HRI) datasets have highlighted the fusion of speech and gesture, expanding robots' capabilities to absorb explicit and implicit HRI insights. However, existing speech-gesture HRI datasets often focus on elementary tasks, like object pointing and pushing, revealing limitations in scaling to intricate domains and prioritizing human command data over robot behavior records. To bridge these gaps, we introduce NatSGD, a multimodal HRI dataset encompassing human commands through speech and gestures that are natural, synchronized with robot behavior demonstrations. NatSGD serves as a foundational resource at the intersection of machine learning and HRI research, and we demonstrate its effectiveness in training robots to understand tasks through multimodal human commands, emphasizing the significance of jointly considering speech and gestures. We have released our dataset, simulator, and code to facilitate future research in human-robot interaction system learning; access these resources at https://www.snehesh.com/natsgd/
Paper Structure (43 sections, 2 equations, 11 figures, 2 tables)

This paper contains 43 sections, 2 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: NatSGD contains speech, gestures, and demonstration trajectories for everyday food preparing, cooking, and cleaning tasks. The NatSGD dataset potentially enables the learning of complex human-robot interaction tasks due to the rich interaction modalities and strong supervising signals at both trajectory-level (demonstrations) and symbolic-level (ground-truth activities that match humans' intention).
  • Figure 2: In this example, the participants are commanding the robot to cut onions where the participant's natural choice of communication is diverse.
  • Figure 3: Illustration of Our Simulator: Baxter Slicing Onions. This simulator offers a range of multi-view perspectives, capturing both the environment and the robot through static and mobile camera angles. The top row presents the human-first-person view, along with overhead (top left), and kitchen counter (bottom right and bottom left) camera angles. The bottom row showcases the robot's egocentric viewpoint, encompassing RGB, depth, distinctive object segmentation, and category-based semantic segmentation. These diverse perspectives empower the robot to learn and execute tasks based on human speech-gesture commands, as well as its autonomous assessment of the surroundings and object conditions.
  • Figure 4: NatSGD experiment setup top view. In this figure, the participant is pointing at the pot on the right stove and asks the robot to boil some water implying the right burner needing to be turned on. On the bottom left shows the location of the Wizard of Oz hidden to the participant making observations and controlling the robot.
  • Figure 5: The learning framework for translating a pair of speech and gesture data to an LTL formula that can solve multi-modal human task understanding problems.
  • ...and 6 more figures