Table of Contents
Fetching ...

MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory

Ana Carolina Condez, Diogo Tavares, João Magalhães

TL;DR

MoralCLIP introduces a multimodal embedding space grounded in Moral Foundations Theory to enable moral understanding across vision and language. It combines explicit moral supervision with a data-augmentation pipeline (Visual Moral Compass) to scale to 15,000 image–caption pairs labeled for five moral foundations, and integrates a moral loss into CLIP's objective: $ \mathcal{L}_{Total} = \mathcal{L}_{CLIP} + \lambda \cdot \mathcal{L}_{Moral}$, where $\mathcal{L}_{Moral}$ aligns semantic similarity with moral similarity $\text{sim}_{Moral} = 2\frac{|M_{v_i} \cap M_{t_j}|}{|M_{v_i} \cup M_{t_j}|} - 1$. Empirical results show explicit moral supervision (Augmented variants) yields the strongest cross-modal moral alignment, with substantial improvements in MAP for image-to-text and text-to-image retrieval, and qualitative analyses confirm clearer moral clustering in the embedding space. The work demonstrates that embedding moral foundations into vision–language models is feasible and can yield morally coherent multimodal representations, paving the way for ethically aware AI systems. It also discusses limitations of weak labeling and cultural bias, proposing future work on richer label representations and better modeling of inter-foundational relationships.

Abstract

Recent advances in vision-language models have enabled rich semantic understanding across modalities. However, these encoding methods lack the ability to interpret or reason about the moral dimensions of content-a crucial aspect of human cognition. In this paper, we address this gap by introducing MoralCLIP, a novel embedding representation method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). Our approach integrates visual and textual moral cues into a unified embedding space, enabling cross-modal moral alignment. MoralCLIP is grounded on the multi-label dataset Social-Moral Image Database to identify co-occurring moral foundations in visual content. For MoralCLIP training, we design a moral data augmentation strategy to scale our annotated dataset to 15,000 image-text pairs labeled with MFT-aligned dimensions. Our results demonstrate that explicit moral supervision improves both unimodal and multimodal understanding of moral content, establishing a foundation for morally-aware AI systems capable of recognizing and aligning with human moral values.

MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory

TL;DR

MoralCLIP introduces a multimodal embedding space grounded in Moral Foundations Theory to enable moral understanding across vision and language. It combines explicit moral supervision with a data-augmentation pipeline (Visual Moral Compass) to scale to 15,000 image–caption pairs labeled for five moral foundations, and integrates a moral loss into CLIP's objective: , where aligns semantic similarity with moral similarity . Empirical results show explicit moral supervision (Augmented variants) yields the strongest cross-modal moral alignment, with substantial improvements in MAP for image-to-text and text-to-image retrieval, and qualitative analyses confirm clearer moral clustering in the embedding space. The work demonstrates that embedding moral foundations into vision–language models is feasible and can yield morally coherent multimodal representations, paving the way for ethically aware AI systems. It also discusses limitations of weak labeling and cultural bias, proposing future work on richer label representations and better modeling of inter-foundational relationships.

Abstract

Recent advances in vision-language models have enabled rich semantic understanding across modalities. However, these encoding methods lack the ability to interpret or reason about the moral dimensions of content-a crucial aspect of human cognition. In this paper, we address this gap by introducing MoralCLIP, a novel embedding representation method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). Our approach integrates visual and textual moral cues into a unified embedding space, enabling cross-modal moral alignment. MoralCLIP is grounded on the multi-label dataset Social-Moral Image Database to identify co-occurring moral foundations in visual content. For MoralCLIP training, we design a moral data augmentation strategy to scale our annotated dataset to 15,000 image-text pairs labeled with MFT-aligned dimensions. Our results demonstrate that explicit moral supervision improves both unimodal and multimodal understanding of moral content, establishing a foundation for morally-aware AI systems capable of recognizing and aligning with human moral values.

Paper Structure

This paper contains 29 sections, 5 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Dataset distribution showing moral label frequencies across SMID (2,401 preprocessed samples) and ImageNet (10,602 samples) from our 15,000-sample training set. LAION samples (1,997) are omitted as they contain exclusively neutral moral content.
  • Figure 2: t-SNE visualization of embedding spaces across different models and modalities. Points are colored by moral categories. Note that the moral annotations are multi-label, meaning individual samples can exhibit multiple dimensions simultaneously. Across both image and text embedding spaces, MoralCLIP demonstrates clearer separation between moral categories and better clustering of morally similar content compared to the baseline CLIP model.
  • Figure 3: Image-to-Image retrieval comparison between MoralCLIP and CLIP models on the test set. MoralCLIP retrieves similar images depicting human connection across diverse contexts, while CLIP focuses on low-level visual features like color scheme and formal posing. Similarity scores represent cosine similarity. The moral labels in bold match the query's label. Throughout our figures, values marked with * indicate rounding approximations where actual values differ slightly.
  • Figure 4: Text-to-Image retrieval comparison between MoralCLIP and CLIP models on the test set. Given a query of a handshake scene, MoralCLIP retrieves images depicting moral themes of care and respect, while CLIP mostly retrieves achromatic images with historical elements. ◆ indicates images also retrieved in our image-to-image evaluation, while ◆ indicates same-position retrievals. Similarity scores represent cosine similarity. The moral labels in bold match the query's label.
  • Figure 5: Classification of SMID images into moral categories based on moral valence and relevance scores. Images are categorized as Vice (purple region, negative moral valence), Virtue (orange region, positive moral valence), or Neither (blue region, neutral moral valence). Thresholds are set at moral scores < 2.5 (negative), > 3.5 (positive), and relevance scores < 2.15 (low), > 2.84 (high) to exclude ambiguous boundary cases.
  • ...and 9 more figures