Table of Contents
Fetching ...

ReStory: VLM-augmentation of Social Human-Robot Interaction Datasets

Fanjun Bu, Wendy Ju

TL;DR

The paper tackles the scarcity and noise of in-the-wild HRI data by combining ethnomethodology-informed storyboards with Vision-Language Models to produce human-interpretable, semantically aligned storyboards that augment existing datasets. ReStory uses EMCA templates, two-caption prompts, and SBERT-based semantic matching to swap frames across footage and generate plausible new interaction episodes. A validation study with seven researchers demonstrates that synthesized storyboards largely preserve core interaction patterns and support frame-by-frame narration, though interpretive variability and causality issues remain. This approach offers a practical, semi-supervised data augmentation and design tool for HRI researchers and interaction designers, enabling richer scenario exploration without additional field data collection.

Abstract

Internet-scaled datasets are a luxury for human-robot interaction (HRI) researchers, as collecting natural interaction data in the wild is time-consuming and logistically challenging. The problem is exacerbated by robots' different form factors and interaction modalities. Inspired by recent work on ethnomethodological and conversation analysis (EMCA) in the domain of HRI, we propose ReStory, a method that has the potential to augment existing in-the-wild human-robot interaction datasets leveraging Vision Language Models. While still requiring human supervision, ReStory is capable of synthesizing human-interpretable interaction scenarios in the form of storyboards. We hope our proposed approach provides HRI researchers and interaction designers with a new angle to utilizing their valuable and scarce data.

ReStory: VLM-augmentation of Social Human-Robot Interaction Datasets

TL;DR

The paper tackles the scarcity and noise of in-the-wild HRI data by combining ethnomethodology-informed storyboards with Vision-Language Models to produce human-interpretable, semantically aligned storyboards that augment existing datasets. ReStory uses EMCA templates, two-caption prompts, and SBERT-based semantic matching to swap frames across footage and generate plausible new interaction episodes. A validation study with seven researchers demonstrates that synthesized storyboards largely preserve core interaction patterns and support frame-by-frame narration, though interpretive variability and causality issues remain. This approach offers a practical, semi-supervised data augmentation and design tool for HRI researchers and interaction designers, enabling richer scenario exploration without additional field data collection.

Abstract

Internet-scaled datasets are a luxury for human-robot interaction (HRI) researchers, as collecting natural interaction data in the wild is time-consuming and logistically challenging. The problem is exacerbated by robots' different form factors and interaction modalities. Inspired by recent work on ethnomethodological and conversation analysis (EMCA) in the domain of HRI, we propose ReStory, a method that has the potential to augment existing in-the-wild human-robot interaction datasets leveraging Vision Language Models. While still requiring human supervision, ReStory is capable of synthesizing human-interpretable interaction scenarios in the form of storyboards. We hope our proposed approach provides HRI researchers and interaction designers with a new angle to utilizing their valuable and scarce data.
Paper Structure (17 sections, 3 equations, 1 figure)

This paper contains 17 sections, 3 equations, 1 figure.

Figures (1)

  • Figure 1: ReStory pipeline. The new storyboard features the woman interacting with the robot the same way the man does. The reference storyboard is from the original paper published by Brown et al. The input video is sampled at two frames per second.