Table of Contents
Fetching ...

CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer

Lavish Bansal, Naman Mishra

TL;DR

CREST introduces a lightweight, universal multilingual safety classifier trained on 13 high-resource languages to cover 100 languages via cluster-guided cross-lingual transfer. By clustering languages in XLM-R embedding space and translating safety data, CREST achieves strong cross-lingual generalization with 0.5B parameters, enabling on-device deployment. Evaluations across six safety benchmarks demonstrate competitive performance with large guardrails and clear advantages over other small models, including robustness to code-switching and cultural contexts. The work highlights the feasibility and importance of language-agnostic safety systems that scale to global multilingual populations.

Abstract

Ensuring content safety in large language models (LLMs) is essential for their deployment in real-world applications. However, existing safety guardrails are predominantly tailored for high-resource languages, leaving a significant portion of the world's population underrepresented who communicate in low-resource languages. To address this, we introduce CREST (CRoss-lingual Efficient Safety Transfer), a parameter-efficient multilingual safety classification model that supports 100 languages with only 0.5B parameters. By training on a strategically chosen subset of only 13 high-resource languages, our model utilizes cluster-based cross-lingual transfer from a few to 100 languages, enabling effective generalization to both unseen high-resource and low-resource languages. This approach addresses the challenge of limited training data in low-resource settings. We conduct comprehensive evaluations across six safety benchmarks to demonstrate that CREST outperforms existing state-of-the-art guardrails of comparable scale and achieves competitive results against models with significantly larger parameter counts (2.5B parameters and above). Our findings highlight the limitations of language-specific guardrails and underscore the importance of developing universal, language-agnostic safety systems that can scale effectively to serve global populations.

CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer

TL;DR

CREST introduces a lightweight, universal multilingual safety classifier trained on 13 high-resource languages to cover 100 languages via cluster-guided cross-lingual transfer. By clustering languages in XLM-R embedding space and translating safety data, CREST achieves strong cross-lingual generalization with 0.5B parameters, enabling on-device deployment. Evaluations across six safety benchmarks demonstrate competitive performance with large guardrails and clear advantages over other small models, including robustness to code-switching and cultural contexts. The work highlights the feasibility and importance of language-agnostic safety systems that scale to global multilingual populations.

Abstract

Ensuring content safety in large language models (LLMs) is essential for their deployment in real-world applications. However, existing safety guardrails are predominantly tailored for high-resource languages, leaving a significant portion of the world's population underrepresented who communicate in low-resource languages. To address this, we introduce CREST (CRoss-lingual Efficient Safety Transfer), a parameter-efficient multilingual safety classification model that supports 100 languages with only 0.5B parameters. By training on a strategically chosen subset of only 13 high-resource languages, our model utilizes cluster-based cross-lingual transfer from a few to 100 languages, enabling effective generalization to both unseen high-resource and low-resource languages. This approach addresses the challenge of limited training data in low-resource settings. We conduct comprehensive evaluations across six safety benchmarks to demonstrate that CREST outperforms existing state-of-the-art guardrails of comparable scale and achieves competitive results against models with significantly larger parameter counts (2.5B parameters and above). Our findings highlight the limitations of language-specific guardrails and underscore the importance of developing universal, language-agnostic safety systems that can scale effectively to serve global populations.

Paper Structure

This paper contains 35 sections, 2 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Demonstrating the critical multilingual safety gap in current guardrails that effectively block harmful content in high-resource languages, but fail for identical queries in low-resource languages.
  • Figure 2: Languages are clustered into 8 groups based on representational similarity derived from XLM-R embeddings. Within each cluster, high-resource languages selected for training are shown in Blue, and low-resource languages used for evaluation are shown in Red.
  • Figure 3: Average F1 scores of the Crest-Base and Crest-Large across six benchmarks given in Section \ref{['sec:experiments']}. Scores are reported on both ID and OOD_Low languages.
  • Figure 4: Average F1 performance of models across 15 Indic languages trained on one representative Indic language from each resource category.
  • Figure 5: t-SNE visualization of sentence-level embeddings from 11 languages, forming two visually distinct clusters, based on their linguistic similarity. While individual sentence embeddings are shown here for illustrative purposes, clustering across all 100 languages is performed using the mean embedding per language.