Table of Contents
Fetching ...

Fluent dreaming for language models

T. Ben Thompson, Zygimantas Straznickas, Michael Sklar

TL;DR

Fluent dreaming for language models introduces Evolutionary Prompt Optimization (EPO), a method that builds a Pareto frontier between a differentiable internal feature $f(\mathbf{t})$ and prompt fluency to enable fluent dreaming in text models. By maintaining a population of prompts with different fluency weights and using gradient-informed token substitutions, EPO demonstrates higher activations and more coherent prompts across token logits, MLP neurons, and residual-stream directions in Pythia-12B than max-activating datasets. The results reveal rare or out-of-distribution triggers and polysemantic behavior, enabling deeper mechanistic interpretability and potential red-teaming applications. The approach is general, extensible, and supported by public code, offering a practical tool for probing language-model internals beyond traditional dataset-driven analyses.

Abstract

Feature visualization, also known as "dreaming", offers insights into vision models by optimizing the inputs to maximize a neuron's activation or other internal component. However, dreaming has not been successfully applied to language models because the input space is discrete. We extend Greedy Coordinate Gradient, a method from the language model adversarial attack literature, to design the Evolutionary Prompt Optimization (EPO) algorithm. EPO optimizes the input prompt to simultaneously maximize the Pareto frontier between a chosen internal feature and prompt fluency, enabling fluent dreaming for language models. We demonstrate dreaming with neurons, output logits and arbitrary directions in activation space. We measure the fluency of the resulting prompts and compare language model dreaming with max-activating dataset examples. Critically, fluent dreaming allows automatically exploring the behavior of model internals in reaction to mildly out-of-distribution prompts. Code for running EPO is available at https://github.com/Confirm-Solutions/dreamy. A companion page demonstrating code usage is at https://confirmlabs.org/posts/dreamy.html

Fluent dreaming for language models

TL;DR

Fluent dreaming for language models introduces Evolutionary Prompt Optimization (EPO), a method that builds a Pareto frontier between a differentiable internal feature and prompt fluency to enable fluent dreaming in text models. By maintaining a population of prompts with different fluency weights and using gradient-informed token substitutions, EPO demonstrates higher activations and more coherent prompts across token logits, MLP neurons, and residual-stream directions in Pythia-12B than max-activating datasets. The results reveal rare or out-of-distribution triggers and polysemantic behavior, enabling deeper mechanistic interpretability and potential red-teaming applications. The approach is general, extensible, and supported by public code, offering a practical tool for probing language-model internals beyond traditional dataset-driven analyses.

Abstract

Feature visualization, also known as "dreaming", offers insights into vision models by optimizing the inputs to maximize a neuron's activation or other internal component. However, dreaming has not been successfully applied to language models because the input space is discrete. We extend Greedy Coordinate Gradient, a method from the language model adversarial attack literature, to design the Evolutionary Prompt Optimization (EPO) algorithm. EPO optimizes the input prompt to simultaneously maximize the Pareto frontier between a chosen internal feature and prompt fluency, enabling fluent dreaming for language models. We demonstrate dreaming with neurons, output logits and arbitrary directions in activation space. We measure the fluency of the resulting prompts and compare language model dreaming with max-activating dataset examples. Critically, fluent dreaming allows automatically exploring the behavior of model internals in reaction to mildly out-of-distribution prompts. Code for running EPO is available at https://github.com/Confirm-Solutions/dreamy. A companion page demonstrating code usage is at https://confirmlabs.org/posts/dreamy.html
Paper Structure (12 sections, 4 equations, 6 figures, 5 tables)

This paper contains 12 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Token-level attribution for three of the examples from Table \ref{['tab:six-examples']}. For each token position, we use the same top-k gradient operation used in GCG and EPO to identify 32 candidate replacement tokens. The color of the token in the visualization corresponds to the reduction in activation from the worst substituted token in that position. Dark reds indicate that a different token in that position can reduce the activation to almost zero. The height of the token bar indicates the reduction in activation from the best alternative token in that position. Tall bars indicate that all potential token substitutions reduce activation dramatically and thus the precise token is very important. We share interactive versions of this visualization at https://confirmlabs.org/posts/dreamy.html.
  • Figure 2: a) Pareto frontiers for sixty runs of EPO applied to L10.N5 with different random initializations. b) The evolution of the Pareto frontier during a single run of EPO.
  • Figure 3: Comparing activation and cross-entropy between dreaming outputs and the top 64 max-activating dataset examples from 500 million tokens of the Pile. The black line is schematically separating regions of the plot that are empirically inside and outside the training distribution.
  • Figure 4: Causal attribution visualizations for six prompts produced by applying dreaming to L3.N1000. See the caption of Figure \ref{['fig:attribution']} for an explanation of the attribution method.
  • Figure 5: a) Average cross-entropy plotted against number of tokens for text in the Pile. The dark gray region shows one standard deviation above and below average cross-entropy while the light gray region shows two standard deviations. Note that all the optimized prompts in this paper are 12 tokens in length. b) We plot average cross-entropy across the residual vector alignment dreaming runs as a function of slack below maximum alignment as measured in units of random prompt standard deviations. See the text for a discussion of the units. c) The distribution of cross-entropy for dreaming results with a slack of three standard deviations.d) The distribution over 50 random target vectors of maximum alignment for each method. e) Similar to d but, in order to approximately restrict the analysis to in-distribution prompts, we restrict the dreaming outputs to have cross-entropy below 6. Note that we do not restrict the cross-entropy of the max-activating dataset prompts because they are known to be in-distribution.
  • ...and 1 more figures