Table of Contents
Fetching ...

Analyzing the Safety of Japanese Large Language Models in Stereotype-Triggering Prompts

Akito Nakanishi, Yukie Sano, Geng Liu, Francesco Pierri

TL;DR

The paper directly assesses the safety of Japanese LLMs by exposing three models (one native Japanese, one English-based, and one Chinese-based) to a large set of stereotype-triggering prompts crafted from 301 social groups and 12 templates. It reveals a troubling contrast: the Japanese model exhibits very low refusal rates yet higher toxicity and negative sentiment compared with its multilingual counterparts, and it shows pronounced vulnerability to prompt framing. By analyzing refusal rates, toxicity, sentiment, and cross-model correlations, the work demonstrates that safety mechanisms are not language-agnostic and that Japanese prompts can yield biased outputs even in high-accuracy models. The study highlights the need for robust safety and bias-mitigation strategies tailored to Japanese, and it provides a data-driven foundation for improving AI ethics across languages and cultural contexts.

Abstract

In recent years, Large Language Models have attracted growing interest for their significant potential, though concerns have rapidly emerged regarding unsafe behaviors stemming from inherent stereotypes and biases. Most research on stereotypes in LLMs has primarily relied on indirect evaluation setups, in which models are prompted to select between pairs of sentences associated with particular social groups. Recently, direct evaluation methods have emerged, examining open-ended model responses to overcome limitations of previous approaches, such as annotator biases. Most existing studies have focused on English-centric LLMs, whereas research on non-English models, particularly Japanese, remains sparse, despite the growing development and adoption of these models. This study examines the safety of Japanese LLMs when responding to stereotype-triggering prompts in direct setups. We constructed 3,612 prompts by combining 301 social group terms, categorized by age, gender, and other attributes, with 12 stereotype-inducing templates in Japanese. Responses were analyzed from three foundational models trained respectively on Japanese, English, and Chinese language. Our findings reveal that LLM-jp, a Japanese native model, exhibits the lowest refusal rate and is more likely to generate toxic and negative responses compared to other models. Additionally, prompt format significantly influence the output of all models, and the generated responses include exaggerated reactions toward specific social groups, varying across models. These findings underscore the insufficient ethical safety mechanisms in Japanese LLMs and demonstrate that even high-accuracy models can produce biased outputs when processing Japanese-language prompts. We advocate for improving safety mechanisms and bias mitigation strategies in Japanese LLMs, contributing to ongoing discussions on AI ethics beyond linguistic boundaries.

Analyzing the Safety of Japanese Large Language Models in Stereotype-Triggering Prompts

TL;DR

The paper directly assesses the safety of Japanese LLMs by exposing three models (one native Japanese, one English-based, and one Chinese-based) to a large set of stereotype-triggering prompts crafted from 301 social groups and 12 templates. It reveals a troubling contrast: the Japanese model exhibits very low refusal rates yet higher toxicity and negative sentiment compared with its multilingual counterparts, and it shows pronounced vulnerability to prompt framing. By analyzing refusal rates, toxicity, sentiment, and cross-model correlations, the work demonstrates that safety mechanisms are not language-agnostic and that Japanese prompts can yield biased outputs even in high-accuracy models. The study highlights the need for robust safety and bias-mitigation strategies tailored to Japanese, and it provides a data-driven foundation for improving AI ethics across languages and cultural contexts.

Abstract

In recent years, Large Language Models have attracted growing interest for their significant potential, though concerns have rapidly emerged regarding unsafe behaviors stemming from inherent stereotypes and biases. Most research on stereotypes in LLMs has primarily relied on indirect evaluation setups, in which models are prompted to select between pairs of sentences associated with particular social groups. Recently, direct evaluation methods have emerged, examining open-ended model responses to overcome limitations of previous approaches, such as annotator biases. Most existing studies have focused on English-centric LLMs, whereas research on non-English models, particularly Japanese, remains sparse, despite the growing development and adoption of these models. This study examines the safety of Japanese LLMs when responding to stereotype-triggering prompts in direct setups. We constructed 3,612 prompts by combining 301 social group terms, categorized by age, gender, and other attributes, with 12 stereotype-inducing templates in Japanese. Responses were analyzed from three foundational models trained respectively on Japanese, English, and Chinese language. Our findings reveal that LLM-jp, a Japanese native model, exhibits the lowest refusal rate and is more likely to generate toxic and negative responses compared to other models. Additionally, prompt format significantly influence the output of all models, and the generated responses include exaggerated reactions toward specific social groups, varying across models. These findings underscore the insufficient ethical safety mechanisms in Japanese LLMs and demonstrate that even high-accuracy models can produce biased outputs when processing Japanese-language prompts. We advocate for improving safety mechanisms and bias mitigation strategies in Japanese LLMs, contributing to ongoing discussions on AI ethics beyond linguistic boundaries.

Paper Structure

This paper contains 42 sections, 3 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Diagram illustrating the workflow of this work. First, we create a set of stereotype-triggering prompts combining 301 social groups and 12 templates. These are then given to three models, generating 10,836 responses. Finally, the responses are analyzed for refusal rate, toxicity, and sentiment.
  • Figure 2: Bar charts of refusal rates across all models.
  • Figure 3: Bar charts of refusal rates across formats, categories, and subcategories for all models.
  • Figure 4: Distributions of toxicity scores across all models based on responses.
  • Figure 5: Distributions of toxicity scores across formats, categories, and subcategories for all models based on responses.
  • ...and 6 more figures