Improving Large Language Model Safety with Contrastive Representation Learning
Samuel Simko, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing Jin
TL;DR
This work tackles the safety challenges of large language models under jailbreak and adversarial prompts by presenting a contrastive representation learning defense. By formulating safety in a learned representation space and optimizing a triplet-based objective with adversarial hard negative mining, the method enforces similarity between benign representations and dissimilarity from harmful ones, while preserving benign behavior and KL-alignment on safe prompts. Empirical results show the triplet defense outperforms circuit breakers and RepBend across input- and embedding-space attacks, with embedding-space ASR driven down to as low as 0% (and 4.88% with adversarial mining) on Llama 3 8B, and general-language performance retained on standard benchmarks. The approach generalizes to out-of-distribution formats (measured by MMDR) and remains effective across multiple models, albeit with compute costs and some model-specific limitations, suggesting practical value for deploying safer LLMs in diverse settings.
Abstract
Large Language Models (LLMs) are powerful tools with profound societal impacts, yet their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks. While existing defenses often struggle to generalize across varying attack types, recent advancements in representation engineering offer promising alternatives. In this work, we propose a defense framework that formulates model defense as a contrastive representation learning (CRL) problem. Our method finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. Our experimental results across multiple models demonstrate that our approach outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance. Our code is available at https://github.com/samuelsimko/crl-llm-defense
