Table of Contents
Fetching ...

Towards Social AI: A Survey on Understanding Social Interactions

Sangmin Lee, Minzhi Li, Bolin Lai, Wenqi Jia, Fiona Ryan, Xu Cao, Ozgur Kara, Bikram Boote, Weiyan Shi, Diyi Yang, James M. Rehg

TL;DR

This survey articulates a three-fold framework for social AI: mastering multimodal cues, modeling multi-party dynamics, and encoding beliefs. It systematically organizes literature on verbal, non-verbal, and multimodal social understanding, reviews core tasks (dialogue act and emotion analysis, common sense reasoning, gesture/gaze/facial expression understanding, and multimodal emotion and conversation analysis), and catalogs key datasets. The authors identify gaps in integration, long-range contextual reasoning, cultural and individual differences, and the need for richer benchmarks, proposing concrete directions for alignment, adaptive fusion, memory-augmented reasoning, and belief-aware evaluation. The work aims to guide future research toward robust, context-aware, and culturally competent social intelligence in AI systems, with implications for embodied and virtual agents, chatbots, and interactive systems.

Abstract

Social interactions form the foundation of human societies. Artificial intelligence has made significant progress in certain areas, but enabling machines to seamlessly understand social interactions remains an open challenge. It is important to address this gap by endowing machines with social capabilities. We identify three key capabilities needed for effective social understanding: 1) understanding multimodal social cues, 2) understanding multi-party dynamics, and 3) understanding beliefs. Building upon these foundations, we classify and review existing machine learning works on social understanding from the perspectives of verbal, non-verbal, and multimodal social cues. The verbal branch focuses on understanding linguistic signals such as speaker intent, dialogue sentiment, and commonsense reasoning. The non-verbal branch addresses techniques for perceiving social meaning from visual behaviors such as body gestures, gaze patterns, and facial expressions. The multimodal branch covers approaches that integrate verbal and non-verbal multimodal cues to holistically interpret social interactions such as recognizing emotions, conversational dynamics, and social situations. By reviewing the scope and limitations of current approaches and benchmarks, we aim to clarify the development trajectory and illuminate the path towards more comprehensive intelligence for social understanding. We hope this survey will spur further research interest and insights into this area.

Towards Social AI: A Survey on Understanding Social Interactions

TL;DR

This survey articulates a three-fold framework for social AI: mastering multimodal cues, modeling multi-party dynamics, and encoding beliefs. It systematically organizes literature on verbal, non-verbal, and multimodal social understanding, reviews core tasks (dialogue act and emotion analysis, common sense reasoning, gesture/gaze/facial expression understanding, and multimodal emotion and conversation analysis), and catalogs key datasets. The authors identify gaps in integration, long-range contextual reasoning, cultural and individual differences, and the need for richer benchmarks, proposing concrete directions for alignment, adaptive fusion, memory-augmented reasoning, and belief-aware evaluation. The work aims to guide future research toward robust, context-aware, and culturally competent social intelligence in AI systems, with implications for embodied and virtual agents, chatbots, and interactive systems.

Abstract

Social interactions form the foundation of human societies. Artificial intelligence has made significant progress in certain areas, but enabling machines to seamlessly understand social interactions remains an open challenge. It is important to address this gap by endowing machines with social capabilities. We identify three key capabilities needed for effective social understanding: 1) understanding multimodal social cues, 2) understanding multi-party dynamics, and 3) understanding beliefs. Building upon these foundations, we classify and review existing machine learning works on social understanding from the perspectives of verbal, non-verbal, and multimodal social cues. The verbal branch focuses on understanding linguistic signals such as speaker intent, dialogue sentiment, and commonsense reasoning. The non-verbal branch addresses techniques for perceiving social meaning from visual behaviors such as body gestures, gaze patterns, and facial expressions. The multimodal branch covers approaches that integrate verbal and non-verbal multimodal cues to holistically interpret social interactions such as recognizing emotions, conversational dynamics, and social situations. By reviewing the scope and limitations of current approaches and benchmarks, we aim to clarify the development trajectory and illuminate the path towards more comprehensive intelligence for social understanding. We hope this survey will spur further research interest and insights into this area.
Paper Structure (65 sections, 11 figures, 3 tables)

This paper contains 65 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Dynamics of social interactions, illustrating three key capabilities: multimodal understanding, multi-party modeling, and belief awareness. Machines need to be equipped with these capabilities to effectively interpret social meaning regarding intentions, emotions, and situational contexts in social interactions.
  • Figure 2: Taxonomy of existing research on social understanding organized according to social cue types, such as verbal and non-verbal cues. The taxonomy covers studies on linguistic understanding from dialogues, visual perception of non-verbal behaviors, and joint understanding of verbal and non-verbal cues.
  • Figure 3: Examples of dialogue act analysis from HOPE malhotra2022speaker. Dialogue act analysis involves classifying the speaker's intention or communicative goal behind the utterance.
  • Figure 4: Examples of dialogue emotion analysis from the Emotionlines dataset hsu2018emotionlines. Dialogue emotion analysis is about contextual understanding of emotion conveyed in an utterance with the help of the dialogue context.
  • Figure 5: Examples of common sense reasoning from the SocialIQA dataset sap2019social. Common sense reasoning includes understanding the causes and consequences of social events.
  • ...and 6 more figures