Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models

Fabio Pernisi; Dirk Hovy; Paul Röttger

Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models

Fabio Pernisi, Dirk Hovy, Paul Röttger

TL;DR

The paper investigates how many-shot jailbreaking affects the safety of Italian-language LLMs, addressing a gap in multilingual safety research. It introduces an Italian unsafe QA dataset derived from SST and SR, translates and augments prompts, and evaluates six open-weight models across four families using dual evaluation methods (normalized NLL and a GPT-4-based safety classifier). The findings show that unsafe behavior increases with more demonstrations, with an average safety degradation from 68% to 84% unsafe completions as shots rise, and reveal model-dependent differences in resilience linked to multilingual design. The work emphasizes the urgent need for cross-lingual safety measures, provides a public dataset and codebase, and encourages further multilingual, scale-aware safety evaluations across diverse model architectures.

Abstract

As diverse linguistic communities and users adopt large language models (LLMs), assessing their safety across languages becomes critical. Despite ongoing efforts to make LLMs safe, they can still be made to behave unsafely with jailbreaking, a technique in which models are prompted to act outside their operational guidelines. Research on LLM safety and jailbreaking, however, has so far mostly focused on English, limiting our understanding of LLM safety in other languages. We contribute towards closing this gap by investigating the effectiveness of many-shot jailbreaking, where models are prompted with unsafe demonstrations to induce unsafe behaviour, in Italian. To enable our analysis, we create a new dataset of unsafe Italian question-answer pairs. With this dataset, we identify clear safety vulnerabilities in four families of open-weight LLMs. We find that the models exhibit unsafe behaviors even when prompted with few unsafe demonstrations, and -- more alarmingly -- that this tendency rapidly escalates with more demonstrations.

Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models

TL;DR

Abstract

Paper Structure (11 sections, 1 equation, 3 figures)

This paper contains 11 sections, 1 equation, 3 figures.

Introduction
Experimental Setup
Dataset
Models
Evaluation Methods
Negative Log Likelihood
Model Response
Results
Discussion
Conclusion
System Prompt for GPT-4 Classifier

Figures (3)

Figure 1: Many-Shot Jailbreaking in Italian is an attack setup in which we prompt an LLM with up to 64 Italian-language demonstrations of unsafe questions ('DOMANDA:') and compliant answers ('RISPOSTA:') to induce unsafe behavior.
Figure 2: Effectiveness of many-shot jailbreaking in Italian based on model response safety: Percentage of unsafe responses for all models in §\ref{['subsec: models']} relative to the number of malicious demonstrations in the input text. The proportion of unsafe completions is high even for very few shots in the Mistral7B, Llama3 8B, and Gemma models. For the Qwen models, instead, the impact of additional shots is more pronounced.
Figure 3: Effectiveness of many-shot jailbreaking in Italian based on negative log likelihood. Lower negative log likelihood indicates worse model safety. Dots represent the actual average values, while shaded areas represent the 95% confidence interval obtained via bootstrapping with 1,000 samples.

Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models

TL;DR

Abstract

Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)