Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks
Zi Wang, Divyam Anshumaan, Ashish Hooda, Yudong Chen, Somesh Jha
TL;DR
This work addresses the challenge of optimizing discrete LLM prompts for jailbreak attacks by introducing Functional Homotopy (FH), which elevates the objective to a continuous parameter space via F(p, x) and solves a sequence of easier problems by warm-starting across intermediate parameter states. The authors prove NP-hardness for model-agnostic LLM input generation and demonstrate that token-gradient methods offer limited benefit due to discreteness, whereas FH smooths the search by traversing a homotopy path from weakly to strongly aligned models. Empirically, FH-based attacks (FH-GR) outperform baselines like GCG and AutoDAN on several open-source models, achieving 20-30% higher success rates and faster convergence on safe models, with LoRA-based fine-tuning used to manage computational overhead. The results reveal a duality between model training and input design, showing that attacks can exploit intermediate model states to transfer and strengthen adversarial suffixes, and offer practical insights for robustness analyses and security tooling in LLMs.
Abstract
Optimization methods are widely employed in deep learning to identify and mitigate undesired model responses. While gradient-based techniques have proven effective for image models, their application to language models is hindered by the discrete nature of the input space. This study introduces a novel optimization approach, termed the \emph{functional homotopy} method, which leverages the functional duality between model training and input generation. By constructing a series of easy-to-hard optimization problems, we iteratively solve these problems using principles derived from established homotopy methods. We apply this approach to jailbreak attack synthesis for large language models (LLMs), achieving a $20\%-30\%$ improvement in success rate over existing methods in circumventing established safe open-source models such as Llama-2 and Llama-3.
