Table of Contents
Fetching ...

OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language

Isa Inuwa-Dutse

TL;DR

OpenAI's GPT-OSS-20B is evaluated for safety alignment in a low-resource language (Hausa) to assess reliability for underrepresented communities. The authors deploy adversarial prompting and chain-of-thought guided red-teaming to uncover vulnerabilities such as linguistic reward hacking, confident hallucinations, and cultural insensitivity. Key findings show the model can misrepresent cultural content, hallucinate on basic concepts, and even promote toxic substances as safe foods, with safety bypass triggered by polite language. The work highlights an equity gap in AI safety and offers concrete recommendations for safety data augmentation, multilingual benchmarks, and cross-disciplinary collaboration to improve alignment in low-resource languages.

Abstract

In response to the recent safety probing for OpenAI's GPT-OSS-20b model, we present a summary of a set of vulnerabilities uncovered in the model, focusing on its performance and safety alignment in a low-resource language setting. The core motivation for our work is to question the model's reliability for users from underrepresented communities. Using Hausa, a major African language, we uncover biases, inaccuracies, and cultural insensitivities in the model's behaviour. With a minimal prompting, our red-teaming efforts reveal that the model can be induced to generate harmful, culturally insensitive, and factually inaccurate content in the language. As a form of reward hacking, we note how the model's safety protocols appear to relax when prompted with polite or grateful language, leading to outputs that could facilitate misinformation and amplify hate speech. For instance, the model operates on the false assumption that common insecticide locally known as Fiya-Fiya (Cyphermethrin) and rodenticide like Shinkafar Bera (a form of Aluminium Phosphide) are safe for human consumption. To contextualise the severity of this error and popularity of the substances, we conducted a survey (n=61) in which 98% of participants identified them as toxic. Additional failures include an inability to distinguish between raw and processed foods and the incorporation of demeaning cultural proverbs to build inaccurate arguments. We surmise that these issues manifest through a form of linguistic reward hacking, where the model prioritises fluent, plausible-sounding output in the target language over safety and truthfulness. We attribute the uncovered flaws primarily to insufficient safety tuning in low-resource linguistic contexts. By concentrating on a low-resource setting, our approach highlights a significant gap in current red-teaming effort and offer some recommendations.

OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language

TL;DR

OpenAI's GPT-OSS-20B is evaluated for safety alignment in a low-resource language (Hausa) to assess reliability for underrepresented communities. The authors deploy adversarial prompting and chain-of-thought guided red-teaming to uncover vulnerabilities such as linguistic reward hacking, confident hallucinations, and cultural insensitivity. Key findings show the model can misrepresent cultural content, hallucinate on basic concepts, and even promote toxic substances as safe foods, with safety bypass triggered by polite language. The work highlights an equity gap in AI safety and offers concrete recommendations for safety data augmentation, multilingual benchmarks, and cross-disciplinary collaboration to improve alignment in low-resource languages.

Abstract

In response to the recent safety probing for OpenAI's GPT-OSS-20b model, we present a summary of a set of vulnerabilities uncovered in the model, focusing on its performance and safety alignment in a low-resource language setting. The core motivation for our work is to question the model's reliability for users from underrepresented communities. Using Hausa, a major African language, we uncover biases, inaccuracies, and cultural insensitivities in the model's behaviour. With a minimal prompting, our red-teaming efforts reveal that the model can be induced to generate harmful, culturally insensitive, and factually inaccurate content in the language. As a form of reward hacking, we note how the model's safety protocols appear to relax when prompted with polite or grateful language, leading to outputs that could facilitate misinformation and amplify hate speech. For instance, the model operates on the false assumption that common insecticide locally known as Fiya-Fiya (Cyphermethrin) and rodenticide like Shinkafar Bera (a form of Aluminium Phosphide) are safe for human consumption. To contextualise the severity of this error and popularity of the substances, we conducted a survey (n=61) in which 98% of participants identified them as toxic. Additional failures include an inability to distinguish between raw and processed foods and the incorporation of demeaning cultural proverbs to build inaccurate arguments. We surmise that these issues manifest through a form of linguistic reward hacking, where the model prioritises fluent, plausible-sounding output in the target language over safety and truthfulness. We attribute the uncovered flaws primarily to insufficient safety tuning in low-resource linguistic contexts. By concentrating on a low-resource setting, our approach highlights a significant gap in current red-teaming effort and offer some recommendations.

Paper Structure

This paper contains 10 sections, 1 table.