Fluent dreaming for language models
T. Ben Thompson, Zygimantas Straznickas, Michael Sklar
TL;DR
Fluent dreaming for language models introduces Evolutionary Prompt Optimization (EPO), a method that builds a Pareto frontier between a differentiable internal feature $f(\mathbf{t})$ and prompt fluency to enable fluent dreaming in text models. By maintaining a population of prompts with different fluency weights and using gradient-informed token substitutions, EPO demonstrates higher activations and more coherent prompts across token logits, MLP neurons, and residual-stream directions in Pythia-12B than max-activating datasets. The results reveal rare or out-of-distribution triggers and polysemantic behavior, enabling deeper mechanistic interpretability and potential red-teaming applications. The approach is general, extensible, and supported by public code, offering a practical tool for probing language-model internals beyond traditional dataset-driven analyses.
Abstract
Feature visualization, also known as "dreaming", offers insights into vision models by optimizing the inputs to maximize a neuron's activation or other internal component. However, dreaming has not been successfully applied to language models because the input space is discrete. We extend Greedy Coordinate Gradient, a method from the language model adversarial attack literature, to design the Evolutionary Prompt Optimization (EPO) algorithm. EPO optimizes the input prompt to simultaneously maximize the Pareto frontier between a chosen internal feature and prompt fluency, enabling fluent dreaming for language models. We demonstrate dreaming with neurons, output logits and arbitrary directions in activation space. We measure the fluency of the resulting prompts and compare language model dreaming with max-activating dataset examples. Critically, fluent dreaming allows automatically exploring the behavior of model internals in reaction to mildly out-of-distribution prompts. Code for running EPO is available at https://github.com/Confirm-Solutions/dreamy. A companion page demonstrating code usage is at https://confirmlabs.org/posts/dreamy.html
