ReStory: VLM-augmentation of Social Human-Robot Interaction Datasets
Fanjun Bu, Wendy Ju
TL;DR
The paper tackles the scarcity and noise of in-the-wild HRI data by combining ethnomethodology-informed storyboards with Vision-Language Models to produce human-interpretable, semantically aligned storyboards that augment existing datasets. ReStory uses EMCA templates, two-caption prompts, and SBERT-based semantic matching to swap frames across footage and generate plausible new interaction episodes. A validation study with seven researchers demonstrates that synthesized storyboards largely preserve core interaction patterns and support frame-by-frame narration, though interpretive variability and causality issues remain. This approach offers a practical, semi-supervised data augmentation and design tool for HRI researchers and interaction designers, enabling richer scenario exploration without additional field data collection.
Abstract
Internet-scaled datasets are a luxury for human-robot interaction (HRI) researchers, as collecting natural interaction data in the wild is time-consuming and logistically challenging. The problem is exacerbated by robots' different form factors and interaction modalities. Inspired by recent work on ethnomethodological and conversation analysis (EMCA) in the domain of HRI, we propose ReStory, a method that has the potential to augment existing in-the-wild human-robot interaction datasets leveraging Vision Language Models. While still requiring human supervision, ReStory is capable of synthesizing human-interpretable interaction scenarios in the form of storyboards. We hope our proposed approach provides HRI researchers and interaction designers with a new angle to utilizing their valuable and scarce data.
