Table of Contents
Fetching ...

FERGI: Automatic Scoring of User Preferences for Text-to-Image Generation from Spontaneous Facial Expression Reaction

Shuangquan Feng, Junhua Ma, Virginia R. de Sa

TL;DR

FERGI introduces automatic scoring of user preferences for text-to-image generation from spontaneous facial expressions by collecting the FERGI dataset and training FAU-Net to map facial action unit activations to a valence score. The FAU-Net score is shown to complement existing pre-trained human-preference models (ImageReward, PickScore, HPS v2), with integrated scoring achieving up to 68.64% accuracy on image-pair preference tasks, indicating improved alignment with human judgments. The work demonstrates that facial-expression-based signals can provide zero-effort annotation signals to guide fine-tuning of text-to-image models and suggests generalization to other generation tasks, while acknowledging practical limitations such as user camera usage and participant awareness. Overall, FERGI offers a scalable, complementary avenue for capturing user preferences and enhancing perceptual quality in generation systems.

Abstract

Researchers have proposed to use data of human preference feedback to fine-tune text-to-image generative models. However, the scalability of human feedback collection has been limited by its reliance on manual annotation. Therefore, we develop and test a method to automatically score user preferences from their spontaneous facial expression reaction to the generated images. We collect a dataset of Facial Expression Reaction to Generated Images (FERGI) and show that the activations of multiple facial action units (AUs) are highly correlated with user evaluations of the generated images. We develop an FAU-Net (Facial Action Units Neural Network), which receives inputs from an AU estimation model, to automatically score user preferences for text-to-image generation based on their facial expression reactions, which is complementary to the pre-trained scoring models based on the input text prompts and generated images. Integrating our FAU-Net valence score with the pre-trained scoring models improves their consistency with human preferences. This method of automatic annotation with facial expression analysis can be potentially generalized to other generation tasks. The code is available at https://github.com/ShuangquanFeng/FERGI, and the dataset is also available at the same link for research purposes.

FERGI: Automatic Scoring of User Preferences for Text-to-Image Generation from Spontaneous Facial Expression Reaction

TL;DR

FERGI introduces automatic scoring of user preferences for text-to-image generation from spontaneous facial expressions by collecting the FERGI dataset and training FAU-Net to map facial action unit activations to a valence score. The FAU-Net score is shown to complement existing pre-trained human-preference models (ImageReward, PickScore, HPS v2), with integrated scoring achieving up to 68.64% accuracy on image-pair preference tasks, indicating improved alignment with human judgments. The work demonstrates that facial-expression-based signals can provide zero-effort annotation signals to guide fine-tuning of text-to-image models and suggests generalization to other generation tasks, while acknowledging practical limitations such as user camera usage and participant awareness. Overall, FERGI offers a scalable, complementary avenue for capturing user preferences and enhancing perceptual quality in generation systems.

Abstract

Researchers have proposed to use data of human preference feedback to fine-tune text-to-image generative models. However, the scalability of human feedback collection has been limited by its reliance on manual annotation. Therefore, we develop and test a method to automatically score user preferences from their spontaneous facial expression reaction to the generated images. We collect a dataset of Facial Expression Reaction to Generated Images (FERGI) and show that the activations of multiple facial action units (AUs) are highly correlated with user evaluations of the generated images. We develop an FAU-Net (Facial Action Units Neural Network), which receives inputs from an AU estimation model, to automatically score user preferences for text-to-image generation based on their facial expression reactions, which is complementary to the pre-trained scoring models based on the input text prompts and generated images. Integrating our FAU-Net valence score with the pre-trained scoring models improves their consistency with human preferences. This method of automatic annotation with facial expression analysis can be potentially generalized to other generation tasks. The code is available at https://github.com/ShuangquanFeng/FERGI, and the dataset is also available at the same link for research purposes.
Paper Structure (39 sections, 9 equations, 18 figures, 3 tables)

This paper contains 39 sections, 9 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Procedure of one session in data collection. Firstly, the participant creates an input text prompt. Secondly, the participant is directed to a webcam preview to confirm that the webcam captures their face appropriately; at the same time, 5 images are generated from the text prompt using Stable Diffusion v1.4. Then, for each image, the participant goes through a 5-second baseline phase, a 5-second image presentation phase, and an image annotation phase with no time restriction. (During the baseline phase, the participant's baseline facial expression is recorded, and during the image presentation phase, the participant's facial expression reaction to the generated image is recorded.) Finally, after all 5 images have been presented and annotated, the participant ranks the 5 generated images from best to worst. A copy of this figure with additional references to the screenshots of each stage is in the supplemental material.
  • Figure 2: Overall ratings highly correlated with AU activation values. The distributions of the activation values of multiple AUs for images of different ratings. Each subfigure shows the results for a different AU (indicated in the captions) and contains 7 boxplots for AU activation values (on the y-axis) for images with 7 different ratings along the x-axis. In each boxplot, the bottom/top of the box represents the first/third quartile (25th/75th percentile) of the AU activation values, and the line in the middle of the box represents the median. Blue bars are used for AUs that are significantly correlated with the ratings while brown bars are used for AUs that are significantly correlated with the extremity of the ratings (computed as $\left|\text{overall rating} - 4\right|$). AU2 and AU12 are shown in blue and brown reflecting significance (after multiple comparisons) for both ratings and extremity of ratings. The darker blue and darker brown colors show the ratings associated with higher AU activations. The numbers of ratings from 1 to 7 are 104, 185, 258, 456, 581, 403, and 216 respectively.
  • Figure 3: Reported emotions highly correlated with AU activation values. The distributions of the activation values of multiple AUs for images eliciting different emotions of the participants as reported in answers to the question "Did you feel any of the following emotions when you saw the image?" Each subfigure shows the results for a different emotion (indicated in the subcaptions) and contains multiple pairs of boxplots representing the results for different AUs. The x-axis represents the indices of the AUs while the y-axis represents the AU activation values in the reaction clip of each image. In each pair of boxplots, the boxplot on the left/right side represents the AU activation values for images that did/didn't elicit the corresponding emotion. In each boxplot, the bottom of the box represents the first quartile (25th percentile) of the AU activation values, the top of the box represents the third quartile (75th percentile), and the line in the middle of the box represents the median. The numbers of responses of each type for each subfigure are as follows. (a) Yes (592); No (1625). (b) Yes (724); No (1493). (c) Yes (352); No (1865). (d) Yes (124); No (2093). (e) Yes (368); No (1849). (f) Yes (194); No (2023).
  • Figure 4: FAU-Net valence score is complementary to the other pre-trained scoring models. Each subfigure shows how annotation accuracy changes if only a subset of the data are selected to annotate by the different scoring models (given in the subcaption label), with the x-axis representing the proportion of data selected (image pairs are selected based on the largest absolute difference of scores for the two images) and the y-axis representing the annotation accuracy within the selected subset. Each model is good at estimating the image pairs that it will perform well on as evidenced by increased accuracy for a smaller selected proportion. Note the blue, green, and orange curves are almost on top of each other.
  • Figure 5: Weights of hidden nodes. Subfigures show the weights from the input preprocessed activation values of the 12 AUs to a hidden node in the FAU-Net, with each subcaption giving the weight from that hidden node to the output node (red/green for negative/positive weights). Four hidden nodes with lowest/highest negative/positive weights to the output node are shown here. Full version with all 16 hidden nodes is shown in the supplemental material.
  • ...and 13 more figures