The Social Gaze of LLMs: A Literature Review of Multimodal Approaches to Human Behavior Understanding
Zihan Liu, Parisa Rabbani, Veda Duddu, Kyle Fan, Madison Lee, Yun Huang
TL;DR
The paper conducts a large-scale, interdisciplinary review of 176 studies on LLM-powered multimodal systems for understanding human behavior. It introduces a four-dimensional coding framework and reveals a strong bias toward perception and reasoning via modality-to-text pipelines, with limited interactive social competencies and ethical guidance beyond privacy concerns. It documents a fragmented evaluation landscape dominated by benchmarks and calls for socially grounded, ethically integrated evaluation and broader multimodal fidelity, including norm-sensitive social knowledge. The authors propose a concrete agenda to advance interaction-aware, fair, and transparent multimodal social AI, emphasizing accountability, user-centered design, and the shift from observer to co-creative agents in real-world settings.
Abstract
LLM-powered multimodal systems are increasingly used to interpret human behavior, yet how researchers apply the models' 'social competence' remains poorly understood. This paper presents a systematic literature review of 176 publications across different application domains (e.g., healthcare, education, and entertainment). Using a four-dimensional coding framework (application, technical, evaluative, and ethical), we find (1) frequent use of pattern recognition and information extraction from multimodal sources, but limited support for adaptive, interactive reasoning; (2) a dominant 'modality-to-text' pipeline that privileges language over rich audiovisual cues, striping away nuanced social cues; (3) evaluation practices reliant on static benchmarks, with socially grounded, human-centered assessments rare; and (4) Ethical discussions focused mainly on legal and rights-related risks (e.g., privacy), leaving societal risks (e.g., deception) overlooked--or at best acknowledged but left unaddressed. We outline a research agenda for evaluating socially competent, ethically informed, and interaction-aware multi-modal systems.
