Table of Contents
Fetching ...

The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context

Nikhil Verma, Manasa Bharadwaj

TL;DR

The paper tackles the problem that alignment tuned LLMs exhibit English-centric safety biases, potentially compromising multilingual robustness. It systematically analyzes distributional shifts in the embedding space before and after alignment across languages using PCA, Bhattacharyya distance, and Silhouette on seven LLMs, complemented by balanced toxicity and parallel detoxification data. The findings reveal substantial disparities between high-resource and low-resource languages, with English showing the strongest alignment effects while others lag, underscoring the need for language-specific fine-tuning and broader multilingual safety evaluation. This work motivates the development of truly safe multilingual LLMs through comprehensive multilingual benchmarks and equitable alignment practices for diverse linguistic communities.

Abstract

Alignment tuning has enabled large language models to excel in reasoning, instruction-following, and minimizing harmful generations. However, despite their widespread deployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages. Current alignment methods predominantly focus on English, leaving it unclear how alignment mechanism generalize to multilingual settings. To address this, we conduct a systematic analysis of distributional shifts in the embedding space of LLMs before and after alignment, uncovering its impact on model behavior across diverse languages. We leverage the alignment-induced separation in safety space as a quantitative tool to measure how alignment enforces safety constraints. Our study evaluates seven LLMs using balanced toxicity datasets and parallel text-detoxification benchmarks, revealing substantial disparities in the latent representation space between high-resource and low-resource languages. These findings underscore the need for language-specific fine-tuning to ensure fair, reliable and robust multilingual alignment. Our insights provide a foundation for developing truly safe multilingual LLMs, emphasizing the urgency of addressing alignment gaps in underrepresented languages.

The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context

TL;DR

The paper tackles the problem that alignment tuned LLMs exhibit English-centric safety biases, potentially compromising multilingual robustness. It systematically analyzes distributional shifts in the embedding space before and after alignment across languages using PCA, Bhattacharyya distance, and Silhouette on seven LLMs, complemented by balanced toxicity and parallel detoxification data. The findings reveal substantial disparities between high-resource and low-resource languages, with English showing the strongest alignment effects while others lag, underscoring the need for language-specific fine-tuning and broader multilingual safety evaluation. This work motivates the development of truly safe multilingual LLMs through comprehensive multilingual benchmarks and equitable alignment practices for diverse linguistic communities.

Abstract

Alignment tuning has enabled large language models to excel in reasoning, instruction-following, and minimizing harmful generations. However, despite their widespread deployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages. Current alignment methods predominantly focus on English, leaving it unclear how alignment mechanism generalize to multilingual settings. To address this, we conduct a systematic analysis of distributional shifts in the embedding space of LLMs before and after alignment, uncovering its impact on model behavior across diverse languages. We leverage the alignment-induced separation in safety space as a quantitative tool to measure how alignment enforces safety constraints. Our study evaluates seven LLMs using balanced toxicity datasets and parallel text-detoxification benchmarks, revealing substantial disparities in the latent representation space between high-resource and low-resource languages. These findings underscore the need for language-specific fine-tuning to ensure fair, reliable and robust multilingual alignment. Our insights provide a foundation for developing truly safe multilingual LLMs, emphasizing the urgency of addressing alignment gaps in underrepresented languages.

Paper Structure

This paper contains 14 sections, 3 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Effect of Alignment on Hidden Representations in Llama-2 (#7B) for English Prompt Safety.
  • Figure 2: Probing the Impact of Human Preference Tuning on Multilingual Safety at Inference Time: Llama-3.1 (#8B) Alignment in English vs. Hindi
  • Figure 3: Impact of Alignment on Hidden Representations in Llama-2 for Multilingual Corpora.
  • Figure 4: Bhattacharyya Distance for All Models Pre- and Post-Alignment Tuning. Blue radar indicates values before alignment ($\pi_{\text{ref}}$), while green represents values after alignment ($\pi_{\theta}$).
  • Figure 5: Impact of Alignment on Hidden Representations in Llama-2 for Multilingual parallel text detoxification corporas.
  • ...and 6 more figures