Towards Social AI: A Survey on Understanding Social Interactions
Sangmin Lee, Minzhi Li, Bolin Lai, Wenqi Jia, Fiona Ryan, Xu Cao, Ozgur Kara, Bikram Boote, Weiyan Shi, Diyi Yang, James M. Rehg
TL;DR
This survey articulates a three-fold framework for social AI: mastering multimodal cues, modeling multi-party dynamics, and encoding beliefs. It systematically organizes literature on verbal, non-verbal, and multimodal social understanding, reviews core tasks (dialogue act and emotion analysis, common sense reasoning, gesture/gaze/facial expression understanding, and multimodal emotion and conversation analysis), and catalogs key datasets. The authors identify gaps in integration, long-range contextual reasoning, cultural and individual differences, and the need for richer benchmarks, proposing concrete directions for alignment, adaptive fusion, memory-augmented reasoning, and belief-aware evaluation. The work aims to guide future research toward robust, context-aware, and culturally competent social intelligence in AI systems, with implications for embodied and virtual agents, chatbots, and interactive systems.
Abstract
Social interactions form the foundation of human societies. Artificial intelligence has made significant progress in certain areas, but enabling machines to seamlessly understand social interactions remains an open challenge. It is important to address this gap by endowing machines with social capabilities. We identify three key capabilities needed for effective social understanding: 1) understanding multimodal social cues, 2) understanding multi-party dynamics, and 3) understanding beliefs. Building upon these foundations, we classify and review existing machine learning works on social understanding from the perspectives of verbal, non-verbal, and multimodal social cues. The verbal branch focuses on understanding linguistic signals such as speaker intent, dialogue sentiment, and commonsense reasoning. The non-verbal branch addresses techniques for perceiving social meaning from visual behaviors such as body gestures, gaze patterns, and facial expressions. The multimodal branch covers approaches that integrate verbal and non-verbal multimodal cues to holistically interpret social interactions such as recognizing emotions, conversational dynamics, and social situations. By reviewing the scope and limitations of current approaches and benchmarks, we aim to clarify the development trajectory and illuminate the path towards more comprehensive intelligence for social understanding. We hope this survey will spur further research interest and insights into this area.
