GPT as Psychologist? Preliminary Evaluations for GPT-4V on Visual Affective Computing
Hao Lu, Xuesong Niu, Jiyao Wang, Yin Wang, Qingyong Hu, Jiaqi Tang, Yuting Zhang, Kaishen Yuan, Bin Huang, Zitong Yu, Dengbo He, Shuiguang Deng, Hao Chen, Yingcong Chen, Shiguang Shan
TL;DR
The paper evaluates GPT-4V on five visual affective tasks to determine its suitability for affective computing, finding strong performance in facial action unit detection ($AU$) and micro-expressions but weaker general facial-expression recognition, especially without contextual cues. It also investigates higher-level reasoning with Chain-of-Thought prompts and demonstrates how GPT-4V can collaborate with Python tools to perform signal-processing tasks such as heart-rate estimation, hinting at a practical framework for multimodal, agent-assisted analysis. Across datasets like DISFA, RAF-DB, CASME2, iMiGUE, and Real-Life Trial, the model shows both notable strengths (AU and some compound-expression inferences) and clear limitations (subjective emotion judgments, micro-expression granularity, deception detection). The authors advocate integrating GPT-4V with task-specific agents and reasoning strategies to realize robust affective computing systems, while calling for improved data, transfer learning, and sensor fusion to address current gaps. Overall, the work provides a pragmatic roadmap for deploying large multimodal models in emotion-aware applications with explicit pathways for enhancement and collaboration with specialized tools.
Abstract
Multimodal large language models (MLLMs) are designed to process and integrate information from multiple sources, such as text, speech, images, and videos. Despite its success in language understanding, it is critical to evaluate the performance of downstream tasks for better human-centric applications. This paper assesses the application of MLLMs with 5 crucial abilities for affective computing, spanning from visual affective tasks and reasoning tasks. The results show that \gpt has high accuracy in facial action unit recognition and micro-expression detection while its general facial expression recognition performance is not accurate. We also highlight the challenges of achieving fine-grained micro-expression recognition and the potential for further study and demonstrate the versatility and potential of \gpt for handling advanced tasks in emotion recognition and related fields by integrating with task-related agents for more complex tasks, such as heart rate estimation through signal processing. In conclusion, this paper provides valuable insights into the potential applications and challenges of MLLMs in human-centric computing. Our interesting examples are at https://github.com/EnVision-Research/GPT4Affectivity.
