Table of Contents
Fetching ...

SocialGesture: Delving into Multi-person Gesture Understanding

Xu Cao, Pranav Virupaksha, Wenqi Jia, Bolin Lai, Fiona Ryan, Sangmin Lee, James M. Rehg

TL;DR

SocialGesture introduces the first large-scale, multi-person gesture dataset and a complementary VQA benchmark to advance understanding of non-verbal social cues in natural interactions. It provides rich annotations—gesture categories, temporal-spatial localization, interaction dynamics, and VQA pairs—and defines three benchmark tasks: temporal localization, gesture recognition, and gesture-focused VQA. Across experiments, current video and vision-language models underperform on multi-person social gestures, especially for localization and cross-modal reasoning, underscoring the need for stronger visual reasoning capabilities in complex social scenes. The work establishes a foundation for robust social gesture understanding and motivates future research in creating models that can jointly reason about multiple people, objects, and language in real-world settings.

Abstract

Previous research in human gesture recognition has largely overlooked multi-person interactions, which are crucial for understanding the social context of naturally occurring gestures. This limitation in existing datasets presents a significant challenge in aligning human gestures with other modalities like language and speech. To address this issue, we introduce SocialGesture, the first large-scale dataset specifically designed for multi-person gesture analysis. SocialGesture features a diverse range of natural scenarios and supports multiple gesture analysis tasks, including video-based recognition and temporal localization, providing a valuable resource for advancing the study of gesture during complex social interactions. Furthermore, we propose a novel visual question answering (VQA) task to benchmark vision language models'(VLMs) performance on social gesture understanding. Our findings highlight several limitations of current gesture recognition models, offering insights into future directions for improvement in this field. SocialGesture is available at huggingface.co/datasets/IrohXu/SocialGesture.

SocialGesture: Delving into Multi-person Gesture Understanding

TL;DR

SocialGesture introduces the first large-scale, multi-person gesture dataset and a complementary VQA benchmark to advance understanding of non-verbal social cues in natural interactions. It provides rich annotations—gesture categories, temporal-spatial localization, interaction dynamics, and VQA pairs—and defines three benchmark tasks: temporal localization, gesture recognition, and gesture-focused VQA. Across experiments, current video and vision-language models underperform on multi-person social gestures, especially for localization and cross-modal reasoning, underscoring the need for stronger visual reasoning capabilities in complex social scenes. The work establishes a foundation for robust social gesture understanding and motivates future research in creating models that can jointly reason about multiple people, objects, and language in real-world settings.

Abstract

Previous research in human gesture recognition has largely overlooked multi-person interactions, which are crucial for understanding the social context of naturally occurring gestures. This limitation in existing datasets presents a significant challenge in aligning human gestures with other modalities like language and speech. To address this issue, we introduce SocialGesture, the first large-scale dataset specifically designed for multi-person gesture analysis. SocialGesture features a diverse range of natural scenarios and supports multiple gesture analysis tasks, including video-based recognition and temporal localization, providing a valuable resource for advancing the study of gesture during complex social interactions. Furthermore, we propose a novel visual question answering (VQA) task to benchmark vision language models'(VLMs) performance on social gesture understanding. Our findings highlight several limitations of current gesture recognition models, offering insights into future directions for improvement in this field. SocialGesture is available at huggingface.co/datasets/IrohXu/SocialGesture.

Paper Structure

This paper contains 27 sections, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Example frames from six gesture datasets. SocialGesture is the only dataset featuring multi-person interactions and focusing on natural gestures with meaningful social communication.
  • Figure 2: Examples of the four deictic gesture categories in SocialGesture with subject-target relationships. From left to right: pointing (directing attention), showing (presenting objects), giving (transfer intention), and reaching (acquisition intention) gestures. Red boxes indicate gesture initiators (subjects) and blue notations indicate targets.
  • Figure 3: Temporal gesture localization task for social gestures
  • Figure 5: The question-answer pairs of SocialGesture. We omit the options of each question in the figure. The bounding box defined by [top-left x, top-left y, bottom-right x, bottom-right y])]. The definition will be provided together with the system prompts.
  • Figure : Pointing Gesture
  • ...and 4 more figures