The Human Robot Social Interaction (HSRI) Dataset: Benchmarking Foundational Models' Social Reasoning
Dong Won Lee, Yubin Kim, Denison Guvenoz, Sooyeon Jeong, Parker Malachowsky, Louis-Philippe Morency, Cynthia Breazeal, Hae Won Park
TL;DR
The HSRI Dataset addresses the need for real-world benchmarks of social reasoning in embodied AI by assembling 440 human-robot interaction videos with over 10K annotations of robot social errors and competencies and seven social attributes. It introduces eight benchmark tasks designed to probe error/competence detection, social-attribute reasoning, interaction-flow understanding, and rationale/correction, and evaluates 17 state-of-the-art models, including language-only and multimodal systems. Across tasks, AI models show substantial gaps relative to human performance, with no single model excelling across all dimensions, underscoring the complexity of real-world social interactions and the value of this dataset as a challenging evaluation bed. The work also situates HSRI within related datasets and annotation frameworks, emphasizing its contribution as a resource to spur development of socially intelligent embodied agents and policy-improving evaluators for AI agents in social settings.
Abstract
Our work aims to advance the social reasoning of embodied artificial intelligence (AI) agents in real-world social interactions. Recently, language models (LMs) and foundational models (FMs) are being utilized as automatic evaluators of human-AI interactions with the goal of eventually being used to improve the policy of the AI agent. To enable further research in this direction, we introduce a large-scale real-world Human Robot Social Interaction (HSRI) Dataset to benchmark the capabilities of LMs and FMs to identify and reason about social interactions, specifically with regard to robot social errors and competencies . Our dataset consists of 400 real-world human social robot interaction videos and over 10K annotations, detailing the robot's social errors, competencies, rationale, and corrective actions, capturing unique aspects of human-AI interaction only present in real-world interactions. To further assess AI models' ability to reason about social interactions, we propose eight new benchmark tasks for evaluating centered around whether AI models can (1) evaluate social interactions via detecting social errors and competencies, (2) identify the explanatory factors associated to errors and competencies, (3) understand the flow of real-world social interactions, and (4) provide reasons and corrective actions for social errors. Human studies and experiments with modern LMs and FMs reveal that current models struggle with these tasks, demonstrating that our dataset and benchmark provides a step forward towards socially intelligent AI.
