SocialGen: Modeling Multi-Human Social Interaction with Language Models
Heng Yu, Juze Zhang, Changan Chen, Tiange Xiang, Yusu Fang, Juan Carlos Niebles, Ehsan Adeli
TL;DR
SocialGen addresses the challenge of modeling social interactions among arbitrary numbers of people by unifying motion and language through a novel XH3D representation and a VQVAE-based motion tokenizer. It couples motion tokens with a language model in a shared vocabulary to enable generation and understanding across captioning, completion, and reaction tasks. The authors also introduce SocialX, a GPT-4o-annotated dataset and benchmark that enables evaluation of multi-human motion-language tasks, achieving state-of-the-art results on several metrics. The framework supports scalable multi-human interaction modeling and holds promise for open-domain social robotics, VR, and human-centric AI applications.
Abstract
Human interactions in everyday life are inherently social, involving engagements with diverse individuals across various contexts. Modeling these social interactions is fundamental to a wide range of real-world applications. In this paper, we introduce SocialGen, the first unified motion-language model capable of modeling interaction behaviors among varying numbers of individuals, to address this crucial yet challenging problem. Unlike prior methods that are limited to two-person interactions, we propose a novel social motion representation that supports tokenizing the motions of an arbitrary number of individuals and aligning them with the language space. This alignment enables the model to leverage rich, pretrained linguistic knowledge to better understand and reason about human social behaviors. To tackle the challenges of data scarcity, we curate a comprehensive multi-human interaction dataset, SocialX, enriched with textual annotations. Leveraging this dataset, we establish the first comprehensive benchmark for multi-human interaction tasks. Our method achieves state-of-the-art performance across motion-language tasks, setting a new standard for multi-human interaction modeling.
