Scaling Down Semantic Leakage: Investigating Associative Bias in Smaller Language Models
Veronika Smilga
TL;DR
The paper investigates semantic leakage across language-model sizes by extending the Gonen et al. framework to the Qwen2.5-Instruct family and introducing a color-focused prompt dataset. It employs a standardized Mean Leak-Rate metric based on $BERT$-score and $SentenceBERT$ to compare test versus control generations across prompts. Findings show leakage persists across all models and datasets ($>50\%$), with larger models often leaking more in a nonlinear fashion, while the smallest model shows the least leakage; repetition-driven leakage also emerges in certain mid-sized configurations. The work provides public data and code, highlights category-dependent leakage patterns, and outlines limitations and directions for mitigation and future exploration of leakage dynamics.
Abstract
Semantic leakage is a phenomenon recently introduced by Gonen et al. (2024). It refers to a situation in which associations learnt from the training data emerge in language model generations in an unexpected and sometimes undesired way. Prior work has focused on leakage in large language models (7B+ parameters). In this study, I use Qwen2.5 model family to explore whether smaller models, ranging from 500M to 7B parameters, demonstrate less semantic leakage due to their limited capacity for capturing complex associations. Building on the previous dataset from Gonen et al. (2024), I introduce a new dataset of color-focused prompts, categorized into specific types of semantic associations, to systematically evaluate the models' performance. Results indicate that smaller models exhibit less semantic leakage overall, although this trend is not strictly linear, with medium-sized models sometimes surpassing larger ones in leaking behavior. The dataset, the model generations, and the evaluation code are publicly available at https://github.com/smilni/semantic_leakage_project.
