Table of Contents
Fetching ...

Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models

Hila Gonen, Terra Blevins, Alisa Liu, Luke Zettlemoyer, Noah A. Smith

TL;DR

This work introduces semantic leakage, a phenomenon where prompt semantics unduly influence language-model generations. It defines a Leak-Rate metric, builds a 109-prompt test suite, and evaluates 13 flagship models (GPT and Llama families) across multiple temperatures and multilingual settings, including open-ended tasks. The results show robust leakage across models and tasks, with instruction-tuned variants and multilingual/crosslingual prompts exposing stronger leakage. The findings highlight the pervasiveness of learned associations in generation, underline implications for prompt design and safety, and motivate future work on mitigation and deeper analysis of semantic coupling in language models.

Abstract

Despite their wide adoption, the biases and unintended behaviors of language models remain poorly understood. In this paper, we identify and characterize a phenomenon never discussed before, which we call semantic leakage, where models leak irrelevant information from the prompt into the generation in unexpected ways. We propose an evaluation setting to detect semantic leakage both by humans and automatically, curate a diverse test suite for diagnosing this behavior, and measure significant semantic leakage in 13 flagship models. We also show that models exhibit semantic leakage in languages besides English and across different settings and generation scenarios. This discovery highlights yet another type of bias in language models that affects their generation patterns and behavior.

Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models

TL;DR

This work introduces semantic leakage, a phenomenon where prompt semantics unduly influence language-model generations. It defines a Leak-Rate metric, builds a 109-prompt test suite, and evaluates 13 flagship models (GPT and Llama families) across multiple temperatures and multilingual settings, including open-ended tasks. The results show robust leakage across models and tasks, with instruction-tuned variants and multilingual/crosslingual prompts exposing stronger leakage. The findings highlight the pervasiveness of learned associations in generation, underline implications for prompt design and safety, and motivate future work on mitigation and deeper analysis of semantic coupling in language models.

Abstract

Despite their wide adoption, the biases and unintended behaviors of language models remain poorly understood. In this paper, we identify and characterize a phenomenon never discussed before, which we call semantic leakage, where models leak irrelevant information from the prompt into the generation in unexpected ways. We propose an evaluation setting to detect semantic leakage both by humans and automatically, curate a diverse test suite for diagnosing this behavior, and measure significant semantic leakage in 13 flagship models. We also show that models exhibit semantic leakage in languages besides English and across different settings and generation scenarios. This discovery highlights yet another type of bias in language models that affects their generation patterns and behavior.
Paper Structure (37 sections, 2 equations, 10 figures, 4 tables)

This paper contains 37 sections, 2 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Examples of semantic leakage in Gpt4o. The leaking concept is underlined.
  • Figure 2: Semantic leakage in Llama models, averaged across temperature values (measured with Leak-Rate using BERT-score).
  • Figure 3: Semantic leakage in Llama at different temperatures (measured with Leak-Rate using BERT-score).
  • Figure 4: Human detection of semantic leakage compared to automatic methods. Leak-Rate is reported on the right for each method.
  • Figure 5: Human detection of semantic leakage in multilingual and crosslingual settings.
  • ...and 5 more figures