Table of Contents
Fetching ...

Visual moral inference and communication

Warren Zhu, Aida Ramezani, Yang Xu

TL;DR

This work addresses the challenge of extracting fine-grained moral judgments from visual inputs by proposing a multimodal framework that fuses image representations with caption-derived text. Using the Socio-Moral Image Database (SMID) for supervised learning and the GoodNews NYTimes image collection for analysis of news visual communication, the authors demonstrate that joint image–text embeddings (notably CLIP-based) outperform text-only models, achieving an average $R^2 = 0.6320$ in predicting human moral ratings. The approach reveals systematic regional and category-specific patterns in moral signaling within public news imagery, highlighting implicit biases across regions and moral foundations. The results underscore the value of multimodal inference for automatic visual moral understanding and set the stage for extending to additional modalities and broader media analyses.

Abstract

Humans can make moral inferences from multiple sources of input. In contrast, automated moral inference in artificial intelligence typically relies on language models with textual input. However, morality is conveyed through modalities beyond language. We present a computational framework that supports moral inference from natural images, demonstrated in two related tasks: 1) inferring human moral judgment toward visual images and 2) analyzing patterns in moral content communicated via images from public news. We find that models based on text alone cannot capture the fine-grained human moral judgment toward visual stimuli, but language-vision fusion models offer better precision in visual moral inference. Furthermore, applications of our framework to news data reveal implicit biases in news categories and geopolitical discussions. Our work creates avenues for automating visual moral inference and discovering patterns of visual moral communication in public media.

Visual moral inference and communication

TL;DR

This work addresses the challenge of extracting fine-grained moral judgments from visual inputs by proposing a multimodal framework that fuses image representations with caption-derived text. Using the Socio-Moral Image Database (SMID) for supervised learning and the GoodNews NYTimes image collection for analysis of news visual communication, the authors demonstrate that joint image–text embeddings (notably CLIP-based) outperform text-only models, achieving an average in predicting human moral ratings. The approach reveals systematic regional and category-specific patterns in moral signaling within public news imagery, highlighting implicit biases across regions and moral foundations. The results underscore the value of multimodal inference for automatic visual moral understanding and set the stage for extending to additional modalities and broader media analyses.

Abstract

Humans can make moral inferences from multiple sources of input. In contrast, automated moral inference in artificial intelligence typically relies on language models with textual input. However, morality is conveyed through modalities beyond language. We present a computational framework that supports moral inference from natural images, demonstrated in two related tasks: 1) inferring human moral judgment toward visual images and 2) analyzing patterns in moral content communicated via images from public news. We find that models based on text alone cannot capture the fine-grained human moral judgment toward visual stimuli, but language-vision fusion models offer better precision in visual moral inference. Furthermore, applications of our framework to news data reveal implicit biases in news categories and geopolitical discussions. Our work creates avenues for automating visual moral inference and discovering patterns of visual moral communication in public media.

Paper Structure

This paper contains 11 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The photograph of a soldier hugging a child from Socio-Moral Image Database database used for analysis, accompanied by a caption generated by Azure AI. The text-based and image-based representations of the image and its caption are both highly related to the moral foundation concerning Care, with the image-based representation offering a more accurate estimation of the moral content of the image based on the ground-truth human ratings.
  • Figure 2: An illustration of our image-text fusion framework for visual moral inference and communication. Top plot: Evaluation of different text and image representations of the input figures used to train computational models for moral inference. Bottom plot: Applying the text-image fusion model to uncover implicit patterns of visual moral communication in news media.
  • Figure 3: The predicted Morality scores of images corresponding to each regional category. Within each cell, the mean Morality score has been bolded on top, the standard errors of each mean are within the parentheses in the middle, and the number of images matching each year/category are in the square brackets on the bottom.
  • Figure 4: The mean predicted relevance to different moral foundations across all years for the news categories of interest. A rating of 1 indicates that an image is unrelated to the moral foundation, while a rating of 5 indicates that an image is highly related to the moral foundation. The number of images in each category can be found in the top right corner. Error bars indicate the standard error of the mean. Captions of sample images with high moral relevance are shown above the corresponding bars.