Table of Contents
Fetching ...

GPT as Psychologist? Preliminary Evaluations for GPT-4V on Visual Affective Computing

Hao Lu, Xuesong Niu, Jiyao Wang, Yin Wang, Qingyong Hu, Jiaqi Tang, Yuting Zhang, Kaishen Yuan, Bin Huang, Zitong Yu, Dengbo He, Shuiguang Deng, Hao Chen, Yingcong Chen, Shiguang Shan

TL;DR

The paper evaluates GPT-4V on five visual affective tasks to determine its suitability for affective computing, finding strong performance in facial action unit detection ($AU$) and micro-expressions but weaker general facial-expression recognition, especially without contextual cues. It also investigates higher-level reasoning with Chain-of-Thought prompts and demonstrates how GPT-4V can collaborate with Python tools to perform signal-processing tasks such as heart-rate estimation, hinting at a practical framework for multimodal, agent-assisted analysis. Across datasets like DISFA, RAF-DB, CASME2, iMiGUE, and Real-Life Trial, the model shows both notable strengths (AU and some compound-expression inferences) and clear limitations (subjective emotion judgments, micro-expression granularity, deception detection). The authors advocate integrating GPT-4V with task-specific agents and reasoning strategies to realize robust affective computing systems, while calling for improved data, transfer learning, and sensor fusion to address current gaps. Overall, the work provides a pragmatic roadmap for deploying large multimodal models in emotion-aware applications with explicit pathways for enhancement and collaboration with specialized tools.

Abstract

Multimodal large language models (MLLMs) are designed to process and integrate information from multiple sources, such as text, speech, images, and videos. Despite its success in language understanding, it is critical to evaluate the performance of downstream tasks for better human-centric applications. This paper assesses the application of MLLMs with 5 crucial abilities for affective computing, spanning from visual affective tasks and reasoning tasks. The results show that \gpt has high accuracy in facial action unit recognition and micro-expression detection while its general facial expression recognition performance is not accurate. We also highlight the challenges of achieving fine-grained micro-expression recognition and the potential for further study and demonstrate the versatility and potential of \gpt for handling advanced tasks in emotion recognition and related fields by integrating with task-related agents for more complex tasks, such as heart rate estimation through signal processing. In conclusion, this paper provides valuable insights into the potential applications and challenges of MLLMs in human-centric computing. Our interesting examples are at https://github.com/EnVision-Research/GPT4Affectivity.

GPT as Psychologist? Preliminary Evaluations for GPT-4V on Visual Affective Computing

TL;DR

The paper evaluates GPT-4V on five visual affective tasks to determine its suitability for affective computing, finding strong performance in facial action unit detection () and micro-expressions but weaker general facial-expression recognition, especially without contextual cues. It also investigates higher-level reasoning with Chain-of-Thought prompts and demonstrates how GPT-4V can collaborate with Python tools to perform signal-processing tasks such as heart-rate estimation, hinting at a practical framework for multimodal, agent-assisted analysis. Across datasets like DISFA, RAF-DB, CASME2, iMiGUE, and Real-Life Trial, the model shows both notable strengths (AU and some compound-expression inferences) and clear limitations (subjective emotion judgments, micro-expression granularity, deception detection). The authors advocate integrating GPT-4V with task-specific agents and reasoning strategies to realize robust affective computing systems, while calling for improved data, transfer learning, and sensor fusion to address current gaps. Overall, the work provides a pragmatic roadmap for deploying large multimodal models in emotion-aware applications with explicit pathways for enhancement and collaboration with specialized tools.

Abstract

Multimodal large language models (MLLMs) are designed to process and integrate information from multiple sources, such as text, speech, images, and videos. Despite its success in language understanding, it is critical to evaluate the performance of downstream tasks for better human-centric applications. This paper assesses the application of MLLMs with 5 crucial abilities for affective computing, spanning from visual affective tasks and reasoning tasks. The results show that \gpt has high accuracy in facial action unit recognition and micro-expression detection while its general facial expression recognition performance is not accurate. We also highlight the challenges of achieving fine-grained micro-expression recognition and the potential for further study and demonstrate the versatility and potential of \gpt for handling advanced tasks in emotion recognition and related fields by integrating with task-related agents for more complex tasks, such as heart rate estimation through signal processing. In conclusion, this paper provides valuable insights into the potential applications and challenges of MLLMs in human-centric computing. Our interesting examples are at https://github.com/EnVision-Research/GPT4Affectivity.
Paper Structure (13 sections, 10 figures, 1 table)

This paper contains 13 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: The propaganda image was generated by DALL$\cdot$E2.
  • Figure 2: Action Unit detection on DISFA disfa dataset. We use the single round for the action unit. GPT-4V can accurately identify each AU.
  • Figure 3: Expression recognition on RAF-DB shan2018reliable dataset. GPT-4V cannot achieve good performance on the subjective task of emotion recognition.
  • Figure 4: Compound emotion recognition on RAF-DB shan2018reliable dataset. GPT-4V can deduce objective compound expressions based on contextual information.
  • Figure 5: Micro-expression recognition on the CASME2 CASME dataset. GPT-4V has difficulty understanding the small differences in the image directly, so it is difficult to understand the micro facial expressions accurately.
  • ...and 5 more figures