Table of Contents
Fetching ...

Visual Prompting in LLMs for Enhancing Emotion Recognition

Qixuan Zhang, Zhifeng Wang, Dylan Zhang, Wenjia Niu, Sabrina Caldwell, Tom Gedeon, Yang Liu, Zhenyue Qin

TL;DR

A novel Set-of-Vision prompting (SoV) approach that enhances zero-shot emotion recognition by using spatial information, such as bounding boxes and facial landmarks, to mark targets precisely and improves accuracy in face count and emotion categorization while preserving the enriched image context.

Abstract

Vision Large Language Models (VLLMs) are transforming the intersection of computer vision and natural language processing. Nonetheless, the potential of using visual prompts for emotion recognition in these models remains largely unexplored and untapped. Traditional methods in VLLMs struggle with spatial localization and often discard valuable global context. To address this problem, we propose a Set-of-Vision prompting (SoV) approach that enhances zero-shot emotion recognition by using spatial information, such as bounding boxes and facial landmarks, to mark targets precisely. SoV improves accuracy in face count and emotion categorization while preserving the enriched image context. Through a battery of experimentation and analysis of recent commercial or open-source VLLMs, we evaluate the SoV model's ability to comprehend facial expressions in natural environments. Our findings demonstrate the effectiveness of integrating spatial visual prompts into VLLMs for improving emotion recognition performance.

Visual Prompting in LLMs for Enhancing Emotion Recognition

TL;DR

A novel Set-of-Vision prompting (SoV) approach that enhances zero-shot emotion recognition by using spatial information, such as bounding boxes and facial landmarks, to mark targets precisely and improves accuracy in face count and emotion categorization while preserving the enriched image context.

Abstract

Vision Large Language Models (VLLMs) are transforming the intersection of computer vision and natural language processing. Nonetheless, the potential of using visual prompts for emotion recognition in these models remains largely unexplored and untapped. Traditional methods in VLLMs struggle with spatial localization and often discard valuable global context. To address this problem, we propose a Set-of-Vision prompting (SoV) approach that enhances zero-shot emotion recognition by using spatial information, such as bounding boxes and facial landmarks, to mark targets precisely. SoV improves accuracy in face count and emotion categorization while preserving the enriched image context. Through a battery of experimentation and analysis of recent commercial or open-source VLLMs, we evaluate the SoV model's ability to comprehend facial expressions in natural environments. Our findings demonstrate the effectiveness of integrating spatial visual prompts into VLLMs for improving emotion recognition performance.
Paper Structure (26 sections, 9 equations, 15 figures, 4 tables, 1 algorithm)

This paper contains 26 sections, 9 equations, 15 figures, 4 tables, 1 algorithm.

Figures (15)

  • Figure 1: Proposed Set-of-Vision (SoV) prompting approach for enhancing facial expression recognition in Vision-Language Large Models (VLLMs). SoV progressively incorporates (1) bounding boxes to identify and locate faces, (2) numbered boxes to ground and differentiate faces, and (3) facial landmarks to analyze spatial relationships for fine-grained emotion classification. This multi-stage visual prompting strategy enables VLLMs to accurately detect and recognize emotions in real-world images while preserving global context.
  • Figure 2: Comparative analysis of emotion recognition methods in a group setting: assessing the precision of facial emotion categorization and face detection using plain text prompts versus Set-of-Vision (SoV) prompts incorporating facial landmarks, bounding boxes, and face enumeration. Top: Results using plain text prompts. Bottom: Results using Set-of-Vision (SoV) prompts. The use of SoV prompts, such as numbering each face, placing bounding boxes, and identifying facial landmarks, allows for a more precise analysis.
  • Figure 3: Workflow diagram for enhanced face recognition and emotion analysis using the Set-of-Vision (SoV) prompting approach: a multi-step process involving face detection, face numbering, landmark extraction, and spatial relationship analysis for emotion classification. Each detected face is analyzed and identified by facial landmarks on the face, such as the positions of the nose, eyes, mouth, and other facial features.
  • Figure 4: Face detection inevitably introduces some overlaps or conflicts that confuse VLLMs. Analyzing the impact of face overlaps, occlusions, landmark misalignment, and bounding box conflicts for emotion recognition.
  • Figure 5: We use two types of prompt methods. Left: plain text prompts, which can be used for group emotion recognition. Right: combined text-vision prompts, which can be used for analyzing specific individuals' emotions. These prompts can be used to evaluate emotional interpretation in social interactions based on facial expressions, body language, and contextual cues.
  • ...and 10 more figures