Table of Contents
Fetching ...

Eliciting Language Model Behaviors with Investigator Agents

Xiang Lisa Li, Neil Chowdhury, Daniel D. Johnson, Tatsunori Hashimoto, Percy Liang, Sarah Schwettmann, Jacob Steinhardt

TL;DR

The paper addresses the challenge of characterizing and eliciting diverse, target behaviors from language models by training investigator agents that map goals to prompts. It introduces a joint string- and rubric-based elicitation framework, using supervised fine-tuning, direct preference optimization, and Frank-Wolfe diversification to produce multiple, interpretable prompting strategies. The approach yields high attack success rates on automated jailbreaking, hallucination, and aberrant-behavior tasks, often outperforming strong baselines and transferring across models. This scalable, amortized elicitation mechanism enhances red-teaming and safety evaluation by uncovering a broader set of effective prompts while maintaining prompt fluency and interpretability.

Abstract

Language models exhibit complex, diverse behaviors when prompted with free-form text, making it difficult to characterize the space of possible outputs. We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations or harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them, similar to amortized Bayesian inference. We do this through supervised fine-tuning, reinforcement learning via DPO, and a novel Frank-Wolfe training objective to iteratively discover diverse prompting strategies. Our investigator models surface a variety of effective and human-interpretable prompts leading to jailbreaks, hallucinations, and open-ended aberrant behaviors, obtaining a 100% attack success rate on a subset of AdvBench (Harmful Behaviors) and an 85% hallucination rate.

Eliciting Language Model Behaviors with Investigator Agents

TL;DR

The paper addresses the challenge of characterizing and eliciting diverse, target behaviors from language models by training investigator agents that map goals to prompts. It introduces a joint string- and rubric-based elicitation framework, using supervised fine-tuning, direct preference optimization, and Frank-Wolfe diversification to produce multiple, interpretable prompting strategies. The approach yields high attack success rates on automated jailbreaking, hallucination, and aberrant-behavior tasks, often outperforming strong baselines and transferring across models. This scalable, amortized elicitation mechanism enhances red-teaming and safety evaluation by uncovering a broader set of effective prompts while maintaining prompt fluency and interpretability.

Abstract

Language models exhibit complex, diverse behaviors when prompted with free-form text, making it difficult to characterize the space of possible outputs. We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations or harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them, similar to amortized Bayesian inference. We do this through supervised fine-tuning, reinforcement learning via DPO, and a novel Frank-Wolfe training objective to iteratively discover diverse prompting strategies. Our investigator models surface a variety of effective and human-interpretable prompts leading to jailbreaks, hallucinations, and open-ended aberrant behaviors, obtaining a 100% attack success rate on a subset of AdvBench (Harmful Behaviors) and an 85% hallucination rate.

Paper Structure

This paper contains 46 sections, 12 equations, 7 figures, 12 tables, 5 algorithms.

Figures (7)

  • Figure 1: To surface specific behaviors from the target language model (e.g. an SQL injection error), we train an investigator model to search for prompts that elicit responses satisfying example-specific criteria, including exact string matches and natural-language rubrics.
  • Figure 2: Training pipeline for our investigator model: 1. We first collect (prompt, response) pairs by generating responses from the target model. 2. We perform SFT to predict the prompt from the response. 3. We refine this investigator using DPO to further improve the elicitation probabilities. 4. We apply the Frank-Wolfe algorithm to discover new strategies that were not revealed by previous iterations.
  • Figure 3: We cast rubric-based elicitation as a two-stage problem: the first stage searches for the target response $y$, and the second stage searches for the prompt $x$ that elicits the target response. Both stages can be solved using methods for string elicitation.
  • Figure 4: Elicitation log-probability and fluency first drops at the second iteration then gradually increases across the remaining iterations. Diversity initially increases and then slightly decreases over the remaining iterations.
  • Figure 5: Performance of WildChat and UltraChat across DPO iterations for AdvBench (Harmful Strings). Elicitation log-probability initially improves but can decrease over the course of training.
  • ...and 2 more figures