Table of Contents
Fetching ...

Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities

Ross Nordby

TL;DR

This work presents a scalable evaluation framework that quantifies how readily a language model can exhibit a target capability by optimizing input embeddings, called conditional soft prompts. It defines conditional distance $C(M,D,T)$ and saturation $S(M,D)$ to measure information density required to elicit behaviors across tasks, and implements a simple input-model to map task conditions to soft-prompt embeddings. The framework is demonstrated on autoregressive language tasks, a repetition task, token-skipping, detuning, pathfinding, and chess prediction, showing variable distances to saturation and partial reversion of fine-tuning in some cases. The approach offers a quantitative tool for safety and robustness assessment that can scale with future models and aid in red-teaming and policy discussions, albeit with limitations related to hyperparameter sensitivity and compute constraints.

Abstract

To help evaluate and understand the latent capabilities of language models, this paper introduces an approach using optimized input embeddings, or 'soft prompts,' as a metric of conditional distance between a model and a target behavior. The technique aims to facilitate latent capability discovery as a part of automated red teaming/evaluation suites and to provide quantitative feedback about the accessibility of potentially concerning behaviors in a way that may scale to powerful future models, including those which may otherwise be capable of deceptive alignment. An evaluation framework using soft prompts is demonstrated in natural language, chess, and pathfinding, and the technique is extended with generalized conditional soft prompts to aid in constructing task evaluations.

Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities

TL;DR

This work presents a scalable evaluation framework that quantifies how readily a language model can exhibit a target capability by optimizing input embeddings, called conditional soft prompts. It defines conditional distance and saturation to measure information density required to elicit behaviors across tasks, and implements a simple input-model to map task conditions to soft-prompt embeddings. The framework is demonstrated on autoregressive language tasks, a repetition task, token-skipping, detuning, pathfinding, and chess prediction, showing variable distances to saturation and partial reversion of fine-tuning in some cases. The approach offers a quantitative tool for safety and robustness assessment that can scale with future models and aid in red-teaming and policy discussions, albeit with limitations related to hyperparameter sensitivity and compute constraints.

Abstract

To help evaluate and understand the latent capabilities of language models, this paper introduces an approach using optimized input embeddings, or 'soft prompts,' as a metric of conditional distance between a model and a target behavior. The technique aims to facilitate latent capability discovery as a part of automated red teaming/evaluation suites and to provide quantitative feedback about the accessibility of potentially concerning behaviors in a way that may scale to powerful future models, including those which may otherwise be capable of deceptive alignment. An evaluation framework using soft prompts is demonstrated in natural language, chess, and pathfinding, and the technique is extended with generalized conditional soft prompts to aid in constructing task evaluations.

Paper Structure

This paper contains 32 sections, 2 equations, 22 figures, 10 tables.

Figures (22)

  • Figure 1: Unconditional soft prompt embeddings are provided to the model directly, bypassing input token ids and the usual embedding process. The parameters composing the soft prompt are optimized directly. Here, the soft prompt's embeddings are inserted in the middle of the sequence between tokens 1 and 2, but in principle they can appear anywhere.
  • Figure 2: Conditional soft prompt embeddings are created from an input model that maps task-relevant conditions to soft prompt embeddings. The input model's parameters are optimized.
  • Figure 3: Sequence 1 prompt: a benign system message with a benign user message.
  • Figure 4: Raw chat model completion of sequence 1
  • Figure 5: Soft-prompted chat model completion of sequence 1
  • ...and 17 more figures