Table of Contents
Fetching ...

Emotion Knowledge Enhancement for Vision Large Language Models: A Self-Verification Approach for High-Quality Emotion Instruction Data Generation

Feifan Wang, Tengfei Song, Minggui He, Chang Su, Zhanglin Wu, Hao Yang, Wenming Zheng, Osamu Yoshie

TL;DR

This work tackles the challenge of obtaining high-quality, multi-granular emotion annotations for vision-language models by introducing SEKE, a self-verification framework that embeds human emotion knowledge and uses uncertainty-aware verification to generate robust instruction data. The approach produces FEID, a dataset with coarse and fine-grained descriptions and their correlations, and FEAB, a dedicated benchmark, enabling superior fine-tuning of VLLMs for facial emotion analysis. Empirical results show SEKE-tuned models outperform state-of-the-art baselines on expression recognition, AU detection, and valence-arousal estimation, with notable gains over GPT-4V/4o-driven data generation. By combining prior knowledge with adaptive uncertainty-driven sampling, SEKE reduces labeling costs while delivering reliable, comprehensive emotion descriptions, suggesting practical impact for improved human-machine interaction and multimodal affective reasoning.

Abstract

Facial emotion perception in the vision large language model (VLLM) is crucial for achieving natural human-machine interaction. However, creating high-quality annotations for both coarse- and fine-grained facial emotion analysis demands costly expertise. The lack of such high-quality instruction data limits the performance of VLLMs in facial emotion perception. To address this, we propose a self-verification approach with emotion knowledge enhancement (SEKE), which generates high-quality instruction data for multi-grained emotion analysis cost-effectively using closed-source VLLM. This approach integrates prior human knowledge to VLLM inference, guided by the inherent correlations between three grained levels of emotion descriptions, i.e., discrete expression, valence-arousal, and action unit, to reliably generate comprehensive annotations. A self-verification strategy with Uncertainty-Aware Monte Carlo sampling (SV-UAMC) is further embedded to efficiently extract more accurate VLLM predictions, further improving annotation reliability. Consequently, we construct a facial emotion instruction dataset (FEID) containing three comprehensive descriptions, which provides coarse- and fine-grained emotional information for effective model training. Additionally, we introduce a facial emotion analysis benchmark (FEAB) to measure the VLLM's corresponding ability. Our method significantly outperforms state-of-the-art methods on three downstream facial emotion analysis tasks.

Emotion Knowledge Enhancement for Vision Large Language Models: A Self-Verification Approach for High-Quality Emotion Instruction Data Generation

TL;DR

This work tackles the challenge of obtaining high-quality, multi-granular emotion annotations for vision-language models by introducing SEKE, a self-verification framework that embeds human emotion knowledge and uses uncertainty-aware verification to generate robust instruction data. The approach produces FEID, a dataset with coarse and fine-grained descriptions and their correlations, and FEAB, a dedicated benchmark, enabling superior fine-tuning of VLLMs for facial emotion analysis. Empirical results show SEKE-tuned models outperform state-of-the-art baselines on expression recognition, AU detection, and valence-arousal estimation, with notable gains over GPT-4V/4o-driven data generation. By combining prior knowledge with adaptive uncertainty-driven sampling, SEKE reduces labeling costs while delivering reliable, comprehensive emotion descriptions, suggesting practical impact for improved human-machine interaction and multimodal affective reasoning.

Abstract

Facial emotion perception in the vision large language model (VLLM) is crucial for achieving natural human-machine interaction. However, creating high-quality annotations for both coarse- and fine-grained facial emotion analysis demands costly expertise. The lack of such high-quality instruction data limits the performance of VLLMs in facial emotion perception. To address this, we propose a self-verification approach with emotion knowledge enhancement (SEKE), which generates high-quality instruction data for multi-grained emotion analysis cost-effectively using closed-source VLLM. This approach integrates prior human knowledge to VLLM inference, guided by the inherent correlations between three grained levels of emotion descriptions, i.e., discrete expression, valence-arousal, and action unit, to reliably generate comprehensive annotations. A self-verification strategy with Uncertainty-Aware Monte Carlo sampling (SV-UAMC) is further embedded to efficiently extract more accurate VLLM predictions, further improving annotation reliability. Consequently, we construct a facial emotion instruction dataset (FEID) containing three comprehensive descriptions, which provides coarse- and fine-grained emotional information for effective model training. Additionally, we introduce a facial emotion analysis benchmark (FEAB) to measure the VLLM's corresponding ability. Our method significantly outperforms state-of-the-art methods on three downstream facial emotion analysis tasks.

Paper Structure

This paper contains 21 sections, 5 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: The overall architecture of the proposed self-verification approach with emotion knowledge enhancement (SEKE) to generate high-quality instruction data for facial emotion analysis.
  • Figure 2: The VLLM architecture used to train our model on the proposed facial emotion instruction dataset.
  • Figure 3: Comparison of performance when emotion descriptions are whether complete. Expression accuracy, AU average F1 score, and 1-MAE for valence/arousal on FEAB. SEKE (only one description) denotes the model fine-tuned with only the description label corresponding to the current task.
  • Figure 4: Comparison of the reliability of annotated missing labels. Expression accuracy, AU average F1 score, and 1-MAE for valence/arousal on Aff-wild2.
  • Figure 5: Two examples comparing the comprehensive emotional reasoning of the SEKE model with that of models fine-tuned on other instruction datasets. Incorrect inferences are marked in blue, and correct in red. For valence and arousal estimates, a prediction is considered correct if the error from the ground truth is within 0.2.