Table of Contents
Fetching ...

Human-Centred Evaluation of Text-to-Image Generation Models for Self-expression of Mental Distress: A Dataset Based on GPT-4o

Sui He, Shenbin Qian

TL;DR

This study investigates whether AI-generated images can enhance self-expression of mental distress among Chinese international students in the UK. It introduces a three-stage methodology: collecting authentic distress descriptions, generating images via four prompt templates with GPT-4o, and evaluating images with both human judgments and automatic metrics. A publicly released dataset comprises 100 descriptions, 400 images, and corresponding human evaluation scores, offering a resource for image-evaluation, RLHF, and multi-modal mental health research. Findings show prompt design meaningfully affects perceived usefulness, with the illustrator style delivering the strongest overall performance, while automatic semantic similarity metrics poorly align with human judgments. The work highlights the value of human-centered evaluation in sensitive domains and lays groundwork for broader, multi-modal mental health communication research.

Abstract

Effective communication is central to achieving positive healthcare outcomes in mental health contexts, yet international students often face linguistic and cultural barriers that hinder their communication of mental distress. In this study, we evaluate the effectiveness of AI-generated images in supporting self-expression of mental distress. To achieve this, twenty Chinese international students studying at UK universities were invited to describe their personal experiences of mental distress. These descriptions were elaborated using GPT-4o with four persona-based prompt templates rooted in contemporary counselling practice to generate corresponding images. Participants then evaluated the helpfulness of generated images in facilitating the expression of their feelings based on their original descriptions. The resulting dataset comprises 100 textual descriptions of mental distress, 400 generated images, and corresponding human evaluation scores. Findings indicate that prompt design substantially affects perceived helpfulness, with the illustrator persona achieving the highest ratings. This work introduces the first publicly available text-to-image evaluation dataset with human judgment scores in the mental health domain, offering valuable resources for image evaluation, reinforcement learning with human feedback, and multi-modal research on mental health communication.

Human-Centred Evaluation of Text-to-Image Generation Models for Self-expression of Mental Distress: A Dataset Based on GPT-4o

TL;DR

This study investigates whether AI-generated images can enhance self-expression of mental distress among Chinese international students in the UK. It introduces a three-stage methodology: collecting authentic distress descriptions, generating images via four prompt templates with GPT-4o, and evaluating images with both human judgments and automatic metrics. A publicly released dataset comprises 100 descriptions, 400 images, and corresponding human evaluation scores, offering a resource for image-evaluation, RLHF, and multi-modal mental health research. Findings show prompt design meaningfully affects perceived usefulness, with the illustrator style delivering the strongest overall performance, while automatic semantic similarity metrics poorly align with human judgments. The work highlights the value of human-centered evaluation in sensitive domains and lays groundwork for broader, multi-modal mental health communication research.

Abstract

Effective communication is central to achieving positive healthcare outcomes in mental health contexts, yet international students often face linguistic and cultural barriers that hinder their communication of mental distress. In this study, we evaluate the effectiveness of AI-generated images in supporting self-expression of mental distress. To achieve this, twenty Chinese international students studying at UK universities were invited to describe their personal experiences of mental distress. These descriptions were elaborated using GPT-4o with four persona-based prompt templates rooted in contemporary counselling practice to generate corresponding images. Participants then evaluated the helpfulness of generated images in facilitating the expression of their feelings based on their original descriptions. The resulting dataset comprises 100 textual descriptions of mental distress, 400 generated images, and corresponding human evaluation scores. Findings indicate that prompt design substantially affects perceived helpfulness, with the illustrator persona achieving the highest ratings. This work introduces the first publicly available text-to-image evaluation dataset with human judgment scores in the mental health domain, offering valuable resources for image evaluation, reinforcement learning with human feedback, and multi-modal research on mental health communication.

Paper Structure

This paper contains 32 sections, 5 figures.

Figures (5)

  • Figure 1: An example of our dataset: textual descriptions, images generated by GPT-4o and their evaluation scores. The participant described a sad experience in Chinese using a metaphor. We used the four prompt templates to ask GPT-4o to generate images, which were given categorical scores by the participant for evaluating their helpfulness.
  • Figure 2: Degree of helpfulness of different prompt templates
  • Figure 3: Total helpfulness scores for each prompt templates
  • Figure 4: The number of "best" images selected by participants for the four prompt templates
  • Figure 5: Similarity scores calculated using BLIP-2 embeddings for images generated based on the four prompt templates