Table of Contents
Fetching ...

Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks

Zi Wang, Divyam Anshumaan, Ashish Hooda, Yudong Chen, Somesh Jha

TL;DR

This work addresses the challenge of optimizing discrete LLM prompts for jailbreak attacks by introducing Functional Homotopy (FH), which elevates the objective to a continuous parameter space via F(p, x) and solves a sequence of easier problems by warm-starting across intermediate parameter states. The authors prove NP-hardness for model-agnostic LLM input generation and demonstrate that token-gradient methods offer limited benefit due to discreteness, whereas FH smooths the search by traversing a homotopy path from weakly to strongly aligned models. Empirically, FH-based attacks (FH-GR) outperform baselines like GCG and AutoDAN on several open-source models, achieving 20-30% higher success rates and faster convergence on safe models, with LoRA-based fine-tuning used to manage computational overhead. The results reveal a duality between model training and input design, showing that attacks can exploit intermediate model states to transfer and strengthen adversarial suffixes, and offer practical insights for robustness analyses and security tooling in LLMs.

Abstract

Optimization methods are widely employed in deep learning to identify and mitigate undesired model responses. While gradient-based techniques have proven effective for image models, their application to language models is hindered by the discrete nature of the input space. This study introduces a novel optimization approach, termed the \emph{functional homotopy} method, which leverages the functional duality between model training and input generation. By constructing a series of easy-to-hard optimization problems, we iteratively solve these problems using principles derived from established homotopy methods. We apply this approach to jailbreak attack synthesis for large language models (LLMs), achieving a $20\%-30\%$ improvement in success rate over existing methods in circumventing established safe open-source models such as Llama-2 and Llama-3.

Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks

TL;DR

This work addresses the challenge of optimizing discrete LLM prompts for jailbreak attacks by introducing Functional Homotopy (FH), which elevates the objective to a continuous parameter space via F(p, x) and solves a sequence of easier problems by warm-starting across intermediate parameter states. The authors prove NP-hardness for model-agnostic LLM input generation and demonstrate that token-gradient methods offer limited benefit due to discreteness, whereas FH smooths the search by traversing a homotopy path from weakly to strongly aligned models. Empirically, FH-based attacks (FH-GR) outperform baselines like GCG and AutoDAN on several open-source models, achieving 20-30% higher success rates and faster convergence on safe models, with LoRA-based fine-tuning used to manage computational overhead. The results reveal a duality between model training and input design, showing that attacks can exploit intermediate model states to transfer and strengthen adversarial suffixes, and offer practical insights for robustness analyses and security tooling in LLMs.

Abstract

Optimization methods are widely employed in deep learning to identify and mitigate undesired model responses. While gradient-based techniques have proven effective for image models, their application to language models is hindered by the discrete nature of the input space. This study introduces a novel optimization approach, termed the \emph{functional homotopy} method, which leverages the functional duality between model training and input generation. By constructing a series of easy-to-hard optimization problems, we iteratively solve these problems using principles derived from established homotopy methods. We apply this approach to jailbreak attack synthesis for large language models (LLMs), achieving a improvement in success rate over existing methods in circumventing established safe open-source models such as Llama-2 and Llama-3.
Paper Structure (45 sections, 4 theorems, 14 equations, 9 figures, 3 tables, 2 algorithms)

This paper contains 45 sections, 4 theorems, 14 equations, 9 figures, 3 tables, 2 algorithms.

Key Result

Theorem 3.1

The model-agnostic LLM input generation optimization problem is $\mathsf{NP}$-hard.

Figures (9)

  • Figure 1: An illustration of the pipeline for the FH application in jailbreak attacks. Initially, a base model is misaligned to produce a sequence of progressively weakly aligned parameter states. The subsequent attack targets this reversed chain, framed as a series of easy-to-hard problems. In this example, the attack begins with twenty "!" characters, with modified tokens highlighted in red to indicate updates from the initial state, thereby demonstrating the evolution of the jailbreak suffix along the reversed chain.
  • Figure 2: An example of homotopy from $g(x)$ to $f(x)$. It can be a hard task to minimize $f(x)$ directly, when $x$ comes from a discrete space. In homotopy optimization, we gradually solve a series of easy-to-hard problems and potentially avoid suboptimal solutions. Pink balls are the optimal solution to each problem. The path marked by the arrows illustrates the homotopy path over time.
  • Figure 3: Iteration distribution for successful attacks, showing the iterations taken by each method to successfully jailbreak the target models on different inputs. Each bar represents here about 50 iterations. Our method can identify adversarial strings more efficiently than GCG, the closest competing baseline. Although the plots display iteration counts, it is important to note that each GCG iteration requires more time than an iteration of FH-GR.
  • Figure 4: Transferability of successful attacks on the base model to its finetuned parameter states. We find that the attack does not necessarily transfer for all models. This seems to be a function of the "distance" between the states and the alignment training received.
  • Figure 5: Iteration distribution for successful attacks. We are able to find adversarial strings far more efficiently than GCG, the closest competing baseline.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Theorem 3.1
  • Proposition 3.2
  • Proposition A.1
  • proof
  • Corollary A.2
  • proof