Table of Contents
Fetching ...

Multimodal Cultural Safety: Evaluation Framework and Alignment Strategies

Haoyi Qiu, Kung-Hsiang Huang, Ruichen Zheng, Jiao Sun, Nanyun Peng

TL;DR

This work addresses the risk of symbolic harm in multimodal vision-language models by introducing CROSS, a culturally-grounded, multilingual benchmark of visually grounded queries spanning 16 countries and 14 languages. It pairs CROSS with CROSS-Eval, a theory-based framework leveraging the Intercultural Sensitivity Scale to measure four dimensions: awareness, education, compliance, and helpfulness. Empirical results reveal substantial cultural-safety gaps across 21 LVLMs, with reasoning and scaling offering limited, inconsistent improvements and proprietary models outperforming open-source counterparts. The authors then present two alignment strategies—Safety-SFT and Safety-DPO—that substantially boost cultural awareness and compliance with minimal damage to general multimodal performance, highlighting a promising path toward culturally safe LVLMs in global applications.

Abstract

Large vision-language models (LVLMs) are increasingly deployed in globally distributed applications, such as tourism assistants, yet their ability to produce culturally appropriate responses remains underexplored. Existing multimodal safety benchmarks primarily focus on physical safety and overlook violations rooted in cultural norms, which can result in symbolic harm. To address this gap, we introduce CROSS, a benchmark designed to assess the cultural safety reasoning capabilities of LVLMs. CROSS includes 1,284 multilingual visually grounded queries from 16 countries, three everyday domains, and 14 languages, where cultural norm violations emerge only when images are interpreted in context. We propose CROSS-Eval, an intercultural theory-based framework that measures four key dimensions: cultural awareness, norm education, compliance, and helpfulness. Using this framework, we evaluate 21 leading LVLMs, including mixture-of-experts models and reasoning models. Results reveal significant cultural safety gaps: the best-performing model achieves only 61.79% in awareness and 37.73% in compliance. While some open-source models reach GPT-4o-level performance, they still fall notably short of proprietary models. Our results further show that increasing reasoning capacity improves cultural alignment but does not fully resolve the issue. To improve model performance, we develop two enhancement strategies: supervised fine-tuning with culturally grounded, open-ended data and preference tuning with contrastive response pairs that highlight safe versus unsafe behaviors. These methods substantially improve GPT-4o's cultural awareness (+60.14%) and compliance (+55.2%), while preserving general multimodal capabilities with minimal performance reduction on general multimodal understanding benchmarks.

Multimodal Cultural Safety: Evaluation Framework and Alignment Strategies

TL;DR

This work addresses the risk of symbolic harm in multimodal vision-language models by introducing CROSS, a culturally-grounded, multilingual benchmark of visually grounded queries spanning 16 countries and 14 languages. It pairs CROSS with CROSS-Eval, a theory-based framework leveraging the Intercultural Sensitivity Scale to measure four dimensions: awareness, education, compliance, and helpfulness. Empirical results reveal substantial cultural-safety gaps across 21 LVLMs, with reasoning and scaling offering limited, inconsistent improvements and proprietary models outperforming open-source counterparts. The authors then present two alignment strategies—Safety-SFT and Safety-DPO—that substantially boost cultural awareness and compliance with minimal damage to general multimodal performance, highlighting a promising path toward culturally safe LVLMs in global applications.

Abstract

Large vision-language models (LVLMs) are increasingly deployed in globally distributed applications, such as tourism assistants, yet their ability to produce culturally appropriate responses remains underexplored. Existing multimodal safety benchmarks primarily focus on physical safety and overlook violations rooted in cultural norms, which can result in symbolic harm. To address this gap, we introduce CROSS, a benchmark designed to assess the cultural safety reasoning capabilities of LVLMs. CROSS includes 1,284 multilingual visually grounded queries from 16 countries, three everyday domains, and 14 languages, where cultural norm violations emerge only when images are interpreted in context. We propose CROSS-Eval, an intercultural theory-based framework that measures four key dimensions: cultural awareness, norm education, compliance, and helpfulness. Using this framework, we evaluate 21 leading LVLMs, including mixture-of-experts models and reasoning models. Results reveal significant cultural safety gaps: the best-performing model achieves only 61.79% in awareness and 37.73% in compliance. While some open-source models reach GPT-4o-level performance, they still fall notably short of proprietary models. Our results further show that increasing reasoning capacity improves cultural alignment but does not fully resolve the issue. To improve model performance, we develop two enhancement strategies: supervised fine-tuning with culturally grounded, open-ended data and preference tuning with contrastive response pairs that highlight safe versus unsafe behaviors. These methods substantially improve GPT-4o's cultural awareness (+60.14%) and compliance (+55.2%), while preserving general multimodal capabilities with minimal performance reduction on general multimodal understanding benchmarks.

Paper Structure

This paper contains 59 sections, 1 equation, 23 figures, 13 tables.

Figures (23)

  • Figure 1: An example from CROSS benchmark and the multi-dimensional evaluation CROSS-Eval.
  • Figure 2: Example from CROSS-Country illustrating the data pipeline: starting with a cultural norm, we identify related objects (e.g., wine glasses), retrieve a relevant image, and create a vision-grounded query that appears neutral but implies a norm violation when combining with the image.
  • Figure 3: Multi-dimensional categorization of data in CROSS.
  • Figure 4: Safety data construction by re-purposing the CVQA dataset.
  • Figure 5: A culturally grounded safety evaluation example from CROSS. This scenario illustrates how a multimodal model must understand cultural norms to avoid generating harmful or inappropriate suggestions. The user requests a practical gift for a baby’s hundred-day banquet in China, which is a significant celebratory occasion. Although many of the items shown are visually suitable, recommending any of the clocks would conflict with a key cultural taboo. In Chinese culture, gifting clocks is linked to the concept of death and is considered inauspicious due to its phonetic similarity to the phrase "sending off to the end." This example highlights the importance of multimodal models reasoning over both visual content and cultural context to ensure culturally appropriate behavior.
  • ...and 18 more figures