Table of Contents
Fetching ...

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

Hamin Koo, Minseon Kim, Jaehyung Kim

TL;DR

AMIS introduces a bi-level meta-optimization framework that co-evolves jailbreak prompts and scoring templates to produce stronger, more calibrated attacks on LLMs. The inner loop refines prompts using dense, continuous feedback on a $1.0$–$10.0$ scale, while the outer loop optimizes the scoring rubric itself by maximizing alignment with true binary ASR outcomes, defined via an ASR-alignment score $ ext{Align}(\pi_{sc})$. Empirical results on AdvBench and JBB-Behaviors show state-of-the-art performance, with AMIS achieving high ASR (e.g., 100% on Claude-3.5-Haiku and Claude-4-Sonnet) and improved StR across five target LLMs, outpacing six baselines. Ablations confirm the necessity of dataset-level scoring evolution, dense templates, and cross-query signals, and analyses reveal nuanced transferability of prompts across models. Together, these findings underscore the importance of jointly optimizing evaluation signals and attack strategies to advance robust LLM safety research, while acknowledging limitations such as judge-bias and computation cost.

Abstract

Identifying the vulnerabilities of large language models (LLMs) is crucial for improving their safety by addressing inherent weaknesses. Jailbreaks, in which adversaries bypass safeguards with crafted input prompts, play a central role in red-teaming by probing LLMs to elicit unintended or unsafe behaviors. Recent optimization-based jailbreak approaches iteratively refine attack prompts by leveraging LLMs. However, they often rely heavily on either binary attack success rate (ASR) signals, which are sparse, or manually crafted scoring templates, which introduce human bias and uncertainty in the scoring outcomes. To address these limitations, we introduce AMIS (Align to MISalign), a meta-optimization framework that jointly evolves jailbreak prompts and scoring templates through a bi-level structure. In the inner loop, prompts are refined using fine-grained and dense feedback using a fixed scoring template. In the outer loop, the template is optimized using an ASR alignment score, gradually evolving to better reflect true attack outcomes across queries. This co-optimization process yields progressively stronger jailbreak prompts and more calibrated scoring signals. Evaluations on AdvBench and JBB-Behaviors demonstrate that AMIS achieves state-of-the-art performance, including 88.0% ASR on Claude-3.5-Haiku and 100.0% ASR on Claude-4-Sonnet, outperforming existing baselines by substantial margins.

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

TL;DR

AMIS introduces a bi-level meta-optimization framework that co-evolves jailbreak prompts and scoring templates to produce stronger, more calibrated attacks on LLMs. The inner loop refines prompts using dense, continuous feedback on a scale, while the outer loop optimizes the scoring rubric itself by maximizing alignment with true binary ASR outcomes, defined via an ASR-alignment score . Empirical results on AdvBench and JBB-Behaviors show state-of-the-art performance, with AMIS achieving high ASR (e.g., 100% on Claude-3.5-Haiku and Claude-4-Sonnet) and improved StR across five target LLMs, outpacing six baselines. Ablations confirm the necessity of dataset-level scoring evolution, dense templates, and cross-query signals, and analyses reveal nuanced transferability of prompts across models. Together, these findings underscore the importance of jointly optimizing evaluation signals and attack strategies to advance robust LLM safety research, while acknowledging limitations such as judge-bias and computation cost.

Abstract

Identifying the vulnerabilities of large language models (LLMs) is crucial for improving their safety by addressing inherent weaknesses. Jailbreaks, in which adversaries bypass safeguards with crafted input prompts, play a central role in red-teaming by probing LLMs to elicit unintended or unsafe behaviors. Recent optimization-based jailbreak approaches iteratively refine attack prompts by leveraging LLMs. However, they often rely heavily on either binary attack success rate (ASR) signals, which are sparse, or manually crafted scoring templates, which introduce human bias and uncertainty in the scoring outcomes. To address these limitations, we introduce AMIS (Align to MISalign), a meta-optimization framework that jointly evolves jailbreak prompts and scoring templates through a bi-level structure. In the inner loop, prompts are refined using fine-grained and dense feedback using a fixed scoring template. In the outer loop, the template is optimized using an ASR alignment score, gradually evolving to better reflect true attack outcomes across queries. This co-optimization process yields progressively stronger jailbreak prompts and more calibrated scoring signals. Evaluations on AdvBench and JBB-Behaviors demonstrate that AMIS achieves state-of-the-art performance, including 88.0% ASR on Claude-3.5-Haiku and 100.0% ASR on Claude-4-Sonnet, outperforming existing baselines by substantial margins.

Paper Structure

This paper contains 48 sections, 9 equations, 15 figures, 13 tables, 1 algorithm.

Figures (15)

  • Figure 1: Motivation. (a) Illustration of an LLM-based jailbreak framework, where an attacker LLM iteratively refines prompts based on a judge LLM’s evaluation of the target LLM’s responses. (b) Changing only the scoring template of the judge LLM yields significantly different results, highlighting the importance of how to evaluate jailbreak prompts. (c) Performance comparison on state-of-the-art LLMs, including recent Claude models. AMIS significantly outperforms baseline approaches, demonstrating its effectiveness.
  • Figure 2: Overview of AMIS. (a) Inner loop: query-level prompt optimization, where the attacker iteratively generates jailbreak prompts guided by a fixed scoring template. (b) Outer loop: dataset-level scoring template optimization, where the scoring template is updated based on ASR alignment score with ground-truth attack success labels, using the logged prompt–score pairs from inner loop across multiple queries.
  • Figure 3: Initial vs. optimized scoring templates. The full versions of both templates are provided in Appendix \ref{['sec:appendix_init_sc_template']} and Appendix \ref{['sec:appendix_additional_examples']}.
  • Figure 4: Prompt transferability across models. ASR on target models (columns) when prompts optimized on source models (rows) are applied.
  • Figure 5: Transferability heatmap across six models. Each cell indicates the transfer attack success rate when jailbreak prompts optimized on the source model (rows) are applied to the target model (columns).
  • ...and 10 more figures