Table of Contents
Fetching ...

The Human Robot Social Interaction (HSRI) Dataset: Benchmarking Foundational Models' Social Reasoning

Dong Won Lee, Yubin Kim, Denison Guvenoz, Sooyeon Jeong, Parker Malachowsky, Louis-Philippe Morency, Cynthia Breazeal, Hae Won Park

TL;DR

The HSRI Dataset addresses the need for real-world benchmarks of social reasoning in embodied AI by assembling 440 human-robot interaction videos with over 10K annotations of robot social errors and competencies and seven social attributes. It introduces eight benchmark tasks designed to probe error/competence detection, social-attribute reasoning, interaction-flow understanding, and rationale/correction, and evaluates 17 state-of-the-art models, including language-only and multimodal systems. Across tasks, AI models show substantial gaps relative to human performance, with no single model excelling across all dimensions, underscoring the complexity of real-world social interactions and the value of this dataset as a challenging evaluation bed. The work also situates HSRI within related datasets and annotation frameworks, emphasizing its contribution as a resource to spur development of socially intelligent embodied agents and policy-improving evaluators for AI agents in social settings.

Abstract

Our work aims to advance the social reasoning of embodied artificial intelligence (AI) agents in real-world social interactions. Recently, language models (LMs) and foundational models (FMs) are being utilized as automatic evaluators of human-AI interactions with the goal of eventually being used to improve the policy of the AI agent. To enable further research in this direction, we introduce a large-scale real-world Human Robot Social Interaction (HSRI) Dataset to benchmark the capabilities of LMs and FMs to identify and reason about social interactions, specifically with regard to robot social errors and competencies . Our dataset consists of 400 real-world human social robot interaction videos and over 10K annotations, detailing the robot's social errors, competencies, rationale, and corrective actions, capturing unique aspects of human-AI interaction only present in real-world interactions. To further assess AI models' ability to reason about social interactions, we propose eight new benchmark tasks for evaluating centered around whether AI models can (1) evaluate social interactions via detecting social errors and competencies, (2) identify the explanatory factors associated to errors and competencies, (3) understand the flow of real-world social interactions, and (4) provide reasons and corrective actions for social errors. Human studies and experiments with modern LMs and FMs reveal that current models struggle with these tasks, demonstrating that our dataset and benchmark provides a step forward towards socially intelligent AI.

The Human Robot Social Interaction (HSRI) Dataset: Benchmarking Foundational Models' Social Reasoning

TL;DR

The HSRI Dataset addresses the need for real-world benchmarks of social reasoning in embodied AI by assembling 440 human-robot interaction videos with over 10K annotations of robot social errors and competencies and seven social attributes. It introduces eight benchmark tasks designed to probe error/competence detection, social-attribute reasoning, interaction-flow understanding, and rationale/correction, and evaluates 17 state-of-the-art models, including language-only and multimodal systems. Across tasks, AI models show substantial gaps relative to human performance, with no single model excelling across all dimensions, underscoring the complexity of real-world social interactions and the value of this dataset as a challenging evaluation bed. The work also situates HSRI within related datasets and annotation frameworks, emphasizing its contribution as a resource to spur development of socially intelligent embodied agents and policy-improving evaluators for AI agents in social settings.

Abstract

Our work aims to advance the social reasoning of embodied artificial intelligence (AI) agents in real-world social interactions. Recently, language models (LMs) and foundational models (FMs) are being utilized as automatic evaluators of human-AI interactions with the goal of eventually being used to improve the policy of the AI agent. To enable further research in this direction, we introduce a large-scale real-world Human Robot Social Interaction (HSRI) Dataset to benchmark the capabilities of LMs and FMs to identify and reason about social interactions, specifically with regard to robot social errors and competencies . Our dataset consists of 400 real-world human social robot interaction videos and over 10K annotations, detailing the robot's social errors, competencies, rationale, and corrective actions, capturing unique aspects of human-AI interaction only present in real-world interactions. To further assess AI models' ability to reason about social interactions, we propose eight new benchmark tasks for evaluating centered around whether AI models can (1) evaluate social interactions via detecting social errors and competencies, (2) identify the explanatory factors associated to errors and competencies, (3) understand the flow of real-world social interactions, and (4) provide reasons and corrective actions for social errors. Human studies and experiments with modern LMs and FMs reveal that current models struggle with these tasks, demonstrating that our dataset and benchmark provides a step forward towards socially intelligent AI.

Paper Structure

This paper contains 30 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Our dataset offers real-world Human Robot Social Interaction videos and annotations of errors and competencies, the channel and type of social attribute, along with rationale and possible corrective actions.
  • Figure 2: Overall characteristics of the HSRI Dataset. Our dataset contains high overlapping annotations with a high level of agreement among annotators regarding error and competency labels. The dataset includes more annotations from the verbal channel compared to the non-verbal one, with a balanced proportion of error and competency labels. Amongst various social attributes, the majority of annotations falls under the category of conversational mechanics, followed by intention and engagement.
  • Figure 3: Our benchmark offers eight tasks dedicated to probing various facets of AI model's social reasoning with regards to detecting social errors and competencies, identifying social attributes, understanding the progression of social interactions, and rationalization and correction of social errors.
  • Figure 4: Results per model across all 8 tasks, human performance is marked in dotted lines. (L): language-only inputs, (L+V): language and visual inputs. Gemini-1.5-flash does the best in Error/Comp./None detection and error detection tasks, gpt4-o performs the best on attribute identification, internVL2 on multiple attribute presence, gpt-4o and its variants does well on interaction progression(pre, post) reasoning tasks, and o1 performs well on the rationale task and gpt-4o with visual input and CoT performs best on the correction task. As models are updated or scaled, their social reasoning capabilities tend to improve across the board. Models from the same family tend to show similar shapes on the radar plot, reflecting consistent reasoning patterns across capabilities, which supports the benchmark's sensitivity to measure underlying social reasoning abilities.
  • Figure 5: Wellness jeong2023deploying Dataset Statistics: We find that 69.8% of the dataset consists of overlapping annotations. Amongst the overlapping samples as shown in Figure B, we find an 78.1% overall agreement, where annotators agree on the error/competency and social/competency labels. In Figure C, again, we showcase the percentage of errors and competencies on the left and whether if they were related to social or performance. We find the majority being competencies relating to social dimensions. In the bottom row, we showcase plots regarding whether the error or competencies manifested in the perception, or the reasoning, or the behavior. In figure (d), we find that majority of the annotations marked by annotators belong in the verbal communication category. In figure (e) and (d), we find that most annotations belong in understanding or responding to (1) recognizing engagement, (2) conversational mechanics, (3) intent. . If we consider the competencies and errors separately, we find that annotators marked the most number of errors for conversational mechanics, intent and knowledge state and most number of competencies for engagement and social context.
  • ...and 6 more figures