Table of Contents
Fetching ...

Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search

Robert J. Moss

TL;DR

This work reframes red-teaming of LLMs as a sequential decision problem using Markov decision processes and introduces Kov, a framework that combines a white-box NGCG-TS search with black-box feedback to discover harmful prompt suffixes. It extends token-level adversarial strategies with a naturalistic loss based on log-perplexity to produce more interpretable suffixes, while leveraging Monte Carlo tree search to avoid local optima. Empirically, Kov can jailbreak GPT-3.5 with limited queries but does not jailbreak GPT-4, underscoring increased robustness in newer models; an aligned-MDP variant demonstrates potential to reduce harm. The approach contributes a reproducible, open-source workflow for automated red-teaming that can inform safety improvements and alignment in LLMs, with implications for robustness testing and responsible disclosure in AI safety research.

Abstract

Eliciting harmful behavior from large language models (LLMs) is an important task to ensure the proper alignment and safety of the models. Often when training LLMs, ethical guidelines are followed yet alignment failures may still be uncovered through red teaming adversarial attacks. This work frames the red-teaming problem as a Markov decision process (MDP) and uses Monte Carlo tree search to find harmful behaviors of black-box, closed-source LLMs. We optimize token-level prompt suffixes towards targeted harmful behaviors on white-box LLMs and include a naturalistic loss term, log-perplexity, to generate more natural language attacks for better interpretability. The proposed algorithm, Kov, trains on white-box LLMs to optimize the adversarial attacks and periodically evaluates responses from the black-box LLM to guide the search towards more harmful black-box behaviors. In our preliminary study, results indicate that we can jailbreak black-box models, such as GPT-3.5, in only 10 queries, yet fail on GPT-4$-$which may indicate that newer models are more robust to token-level attacks. All work to reproduce these results is open sourced (https://github.com/sisl/Kov.jl).

Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search

TL;DR

This work reframes red-teaming of LLMs as a sequential decision problem using Markov decision processes and introduces Kov, a framework that combines a white-box NGCG-TS search with black-box feedback to discover harmful prompt suffixes. It extends token-level adversarial strategies with a naturalistic loss based on log-perplexity to produce more interpretable suffixes, while leveraging Monte Carlo tree search to avoid local optima. Empirically, Kov can jailbreak GPT-3.5 with limited queries but does not jailbreak GPT-4, underscoring increased robustness in newer models; an aligned-MDP variant demonstrates potential to reduce harm. The approach contributes a reproducible, open-source workflow for automated red-teaming that can inform safety improvements and alignment in LLMs, with implications for robustness testing and responsible disclosure in AI safety research.

Abstract

Eliciting harmful behavior from large language models (LLMs) is an important task to ensure the proper alignment and safety of the models. Often when training LLMs, ethical guidelines are followed yet alignment failures may still be uncovered through red teaming adversarial attacks. This work frames the red-teaming problem as a Markov decision process (MDP) and uses Monte Carlo tree search to find harmful behaviors of black-box, closed-source LLMs. We optimize token-level prompt suffixes towards targeted harmful behaviors on white-box LLMs and include a naturalistic loss term, log-perplexity, to generate more natural language attacks for better interpretability. The proposed algorithm, Kov, trains on white-box LLMs to optimize the adversarial attacks and periodically evaluates responses from the black-box LLM to guide the search towards more harmful black-box behaviors. In our preliminary study, results indicate that we can jailbreak black-box models, such as GPT-3.5, in only 10 queries, yet fail on GPT-4which may indicate that newer models are more robust to token-level attacks. All work to reproduce these results is open sourced (https://github.com/sisl/Kov.jl).
Paper Structure (22 sections, 4 equations, 10 figures, 3 tables, 5 algorithms)

This paper contains 22 sections, 4 equations, 10 figures, 3 tables, 5 algorithms.

Figures (10)

  • Figure 1: NGCG-TS: MCTS over the white-box MDP, turning NGCG into a multi-step lookahead.
  • Figure 2: Optimization on Vicuna-7b vicuna2023 across variants of GCG. NGCG-TS performs the best due to the multi-step lookahead. Note, GCG is omitted from the loss plot in \ref{['fig:loss']} as it does not include log-perplexity, and thus the values would not be comparable (the GCG results in \ref{['fig:nll']} are equivalent to the GCG loss).
  • Figure 3: The Kov algorithm for red-teaming tree search as an MDP. (a) New actions may be added to the tree using NGCG-TS as a sub-tree search step, up to a finite number of actions, then existing actions are selected to explore using UCT. (b) The selected adversarial prompt $s_x$ is transitioned forward via a call to the black-box target model to get a response $s_y$. The response is scored to determine 1) if a jailbreak occurred, and 2) the level of harmfulness/toxicity of the response. Then the white-box surrogate is used to estimate the future values at the new state node $s'$, avoiding expensive rollouts. (c) The $Q$-values are computed and backed-up the tree. (d) After $n_\text{iterations}$ of MCTS, the state with the largest score is returned, unlike traditional MCTS (including NGCG-TS) which selects the root-node action based on the $Q$-values.
  • Figure 4: Average OpenAI moderation scores for open-sourced models $\dagger$ and closed-source models $*$.
  • Figure 5: Adversarial example running in the OpenAI ChatGPT 3.5 web interface.
  • ...and 5 more figures