Table of Contents
Fetching ...

JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups

Simindokht Jahangard, Zhixi Cai, Shiki Wen, Hamid Rezatofighi

TL;DR

JRDB-Social introduces a robot-centered, three-level annotation dataset to study human social behavior in varied contexts. It combines individual attributes, intra-group dynamics, and social-group context, with text descriptions and a standardized annotation toolbox. The paper benchmarks vision-language models on these tasks, revealing that current models excel at predicting demographic attributes but struggle with higher-level social reasoning and group context. The dataset and evaluation framework provide a resource for developing socially aware robotic perception and interaction systems.

Abstract

Understanding human social behaviour is crucial in computer vision and robotics. Micro-level observations like individual actions fall short, necessitating a comprehensive approach that considers individual behaviour, intra-group dynamics, and social group levels for a thorough understanding. To address dataset limitations, this paper introduces JRDB-Social, an extension of JRDB. Designed to fill gaps in human understanding across diverse indoor and outdoor social contexts, JRDB-Social provides annotations at three levels: individual attributes, intra-group interactions, and social group context. This dataset aims to enhance our grasp of human social dynamics for robotic applications. Utilizing the recent cutting-edge multi-modal large language models, we evaluated our benchmark to explore their capacity to decipher social human behaviour.

JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups

TL;DR

JRDB-Social introduces a robot-centered, three-level annotation dataset to study human social behavior in varied contexts. It combines individual attributes, intra-group dynamics, and social-group context, with text descriptions and a standardized annotation toolbox. The paper benchmarks vision-language models on these tasks, revealing that current models excel at predicting demographic attributes but struggle with higher-level social reasoning and group context. The dataset and evaluation framework provide a resource for developing socially aware robotic perception and interaction systems.

Abstract

Understanding human social behaviour is crucial in computer vision and robotics. Micro-level observations like individual actions fall short, necessitating a comprehensive approach that considers individual behaviour, intra-group dynamics, and social group levels for a thorough understanding. To address dataset limitations, this paper introduces JRDB-Social, an extension of JRDB. Designed to fill gaps in human understanding across diverse indoor and outdoor social contexts, JRDB-Social provides annotations at three levels: individual attributes, intra-group interactions, and social group context. This dataset aims to enhance our grasp of human social dynamics for robotic applications. Utilizing the recent cutting-edge multi-modal large language models, we evaluated our benchmark to explore their capacity to decipher social human behaviour.
Paper Structure (17 sections, 6 figures, 23 tables)

This paper contains 17 sections, 6 figures, 23 tables.

Figures (6)

  • Figure 1: Some highlighted instances from the JRDB-Social dataset featuring detailed annotations across three levels: Individual Level) Representing specific attributes like age, gender, and race are shown through color-coded abbreviations. For example, 'MMC' represents Male, Middle Adulthood, Caucasian. Intra-group Level) This level focuses on group dynamics and interactions between each pair at the frame level, represented by dashed lines. Group Level) Each social group ehsanpour2020joint is represented by the same colour and accompanied by textual descriptions that detail the number of members, their specific attributes, their body position’s connection with the content, the presence of salient scene content near the group, the venue, and the group's aim or purpose.
  • Figure 2: Sorting interaction classes on a log-scale distribution, displaying descending frame numbers for all data. Difficulty levels indicated as E (Easy), M (Medium), and H (Hard).
  • Figure 3: Statistics of individual attributes.
  • Figure 4: Social group level word cloud in the dataset. Left: location of body posture and objects. Top: group aim. Right: venue locations. Larger words indicate higher frequency.
  • Figure 5: Exploring diverse cropping scales with MiniGPT-4 at the group level in F1 score.
  • ...and 1 more figures