Table of Contents
Fetching ...

The Social Gaze of LLMs: A Literature Review of Multimodal Approaches to Human Behavior Understanding

Zihan Liu, Parisa Rabbani, Veda Duddu, Kyle Fan, Madison Lee, Yun Huang

TL;DR

The paper conducts a large-scale, interdisciplinary review of 176 studies on LLM-powered multimodal systems for understanding human behavior. It introduces a four-dimensional coding framework and reveals a strong bias toward perception and reasoning via modality-to-text pipelines, with limited interactive social competencies and ethical guidance beyond privacy concerns. It documents a fragmented evaluation landscape dominated by benchmarks and calls for socially grounded, ethically integrated evaluation and broader multimodal fidelity, including norm-sensitive social knowledge. The authors propose a concrete agenda to advance interaction-aware, fair, and transparent multimodal social AI, emphasizing accountability, user-centered design, and the shift from observer to co-creative agents in real-world settings.

Abstract

LLM-powered multimodal systems are increasingly used to interpret human behavior, yet how researchers apply the models' 'social competence' remains poorly understood. This paper presents a systematic literature review of 176 publications across different application domains (e.g., healthcare, education, and entertainment). Using a four-dimensional coding framework (application, technical, evaluative, and ethical), we find (1) frequent use of pattern recognition and information extraction from multimodal sources, but limited support for adaptive, interactive reasoning; (2) a dominant 'modality-to-text' pipeline that privileges language over rich audiovisual cues, striping away nuanced social cues; (3) evaluation practices reliant on static benchmarks, with socially grounded, human-centered assessments rare; and (4) Ethical discussions focused mainly on legal and rights-related risks (e.g., privacy), leaving societal risks (e.g., deception) overlooked--or at best acknowledged but left unaddressed. We outline a research agenda for evaluating socially competent, ethically informed, and interaction-aware multi-modal systems.

The Social Gaze of LLMs: A Literature Review of Multimodal Approaches to Human Behavior Understanding

TL;DR

The paper conducts a large-scale, interdisciplinary review of 176 studies on LLM-powered multimodal systems for understanding human behavior. It introduces a four-dimensional coding framework and reveals a strong bias toward perception and reasoning via modality-to-text pipelines, with limited interactive social competencies and ethical guidance beyond privacy concerns. It documents a fragmented evaluation landscape dominated by benchmarks and calls for socially grounded, ethically integrated evaluation and broader multimodal fidelity, including norm-sensitive social knowledge. The authors propose a concrete agenda to advance interaction-aware, fair, and transparent multimodal social AI, emphasizing accountability, user-centered design, and the shift from observer to co-creative agents in real-world settings.

Abstract

LLM-powered multimodal systems are increasingly used to interpret human behavior, yet how researchers apply the models' 'social competence' remains poorly understood. This paper presents a systematic literature review of 176 publications across different application domains (e.g., healthcare, education, and entertainment). Using a four-dimensional coding framework (application, technical, evaluative, and ethical), we find (1) frequent use of pattern recognition and information extraction from multimodal sources, but limited support for adaptive, interactive reasoning; (2) a dominant 'modality-to-text' pipeline that privileges language over rich audiovisual cues, striping away nuanced social cues; (3) evaluation practices reliant on static benchmarks, with socially grounded, human-centered assessments rare; and (4) Ethical discussions focused mainly on legal and rights-related risks (e.g., privacy), leaving societal risks (e.g., deception) overlooked--or at best acknowledged but left unaddressed. We outline a research agenda for evaluating socially competent, ethically informed, and interaction-aware multi-modal systems.

Paper Structure

This paper contains 32 sections, 8 figures.

Figures (8)

  • Figure 1: Workflow of the Literature Search and Selection Process. This flowchart illustrates the four-stage methodology used to identify relevant papers. The process began with an initial pool of 1,350 papers from four databases, which was narrowed down based on inclusion and exclusion criteria, ultimately resulting in a final corpus of 176 papers for analysis.
  • Figure 2: A Conceptual Framework for Analyzing Social Intelligence (RQ1). This diagram outlines the multi-dimensional coding framework used to analyze how social intelligence is applied in LLM-powered systems. It deconstructs the analysis into four key areas: the system's core Social Intelligence Functions, the Social Context it takes as input, the Interaction Context of its output, and the specific Social Intelligence Competencies it demonstrates.
  • Figure 3: Four Major Categories of Behavioral Cues Analyzed in Reviewed Systems. The figure displays the primary types of human behavioral cues that the surveyed systems are designed to interpret. These include verbal and language cues, body and motion cues, vocal and auditory cues, and facial and gaze cues.
  • Figure 4: Analysis of Social Behavior Along Three Key Dimensions. This Sankey diagram visualizes the relationships between the interaction structure, social units, and temporal scale in the reviewed literature. The flow highlights a dominant research focus on analyzing individual behaviors ("single person") at a micro-level ("signals," "behaviors & acts") over brief timescales ("moment," "short-term").
  • Figure 5: Distribution of Social Intelligence Competencies Across Application Domains. This radar chart compares the implementation of six key social intelligence competencies across five major application domains. The chart shows that competencies like social perception and reasoning are nearly universal, while social creativity and social interaction are far less common, particularly in domains like Security & Surveillance.
  • ...and 3 more figures