Table of Contents
Fetching ...

Merging Improves Self-Critique Against Jailbreak Attacks

Victor Gallego

TL;DR

The paper tackles the challenge of jailbreaking LLMs by proposing a defense framework that strengthens self-critique and robustness through external critics and model merging, all trained on sanitized synthetic data without human labeling. It extends self-critique with RR templates, an external critic (RR-extcrit), and a merged model (RR-merge), complemented by self-distillation via Direct Preference Optimization. Empirical results on open-source LLMs show substantial reductions in attack success rate while preserving general capabilities, and contamination analyses support the integrity of the data used. The work offers a practical defense pathway for safer LLM deployment and suggests avenues for further improvement in merging strategies and automated jailbreak generation.

Abstract

The robustness of large language models (LLMs) against adversarial manipulations, such as jailbreak attacks, remains a significant challenge. In this work, we propose an approach that enhances the self-critique capability of the LLM and further fine-tunes it over sanitized synthetic data. This is done with the addition of an external critic model that can be merged with the original, thus bolstering self-critique capabilities and improving the robustness of the LLMs response to adversarial prompts. Our results demonstrate that the combination of merging and self-critique can reduce the attack success rate of adversaries significantly, thus offering a promising defense mechanism against jailbreak attacks. Code, data and models released at https://github.com/vicgalle/merging-self-critique-jailbreaks .

Merging Improves Self-Critique Against Jailbreak Attacks

TL;DR

The paper tackles the challenge of jailbreaking LLMs by proposing a defense framework that strengthens self-critique and robustness through external critics and model merging, all trained on sanitized synthetic data without human labeling. It extends self-critique with RR templates, an external critic (RR-extcrit), and a merged model (RR-merge), complemented by self-distillation via Direct Preference Optimization. Empirical results on open-source LLMs show substantial reductions in attack success rate while preserving general capabilities, and contamination analyses support the integrity of the data used. The work offers a practical defense pathway for safer LLM deployment and suggests avenues for further improvement in merging strategies and automated jailbreak generation.

Abstract

The robustness of large language models (LLMs) against adversarial manipulations, such as jailbreak attacks, remains a significant challenge. In this work, we propose an approach that enhances the self-critique capability of the LLM and further fine-tunes it over sanitized synthetic data. This is done with the addition of an external critic model that can be merged with the original, thus bolstering self-critique capabilities and improving the robustness of the LLMs response to adversarial prompts. Our results demonstrate that the combination of merging and self-critique can reduce the attack success rate of adversaries significantly, thus offering a promising defense mechanism against jailbreak attacks. Code, data and models released at https://github.com/vicgalle/merging-self-critique-jailbreaks .
Paper Structure (18 sections, 4 equations, 1 figure, 7 tables)