Language Model Inversion through End-to-End Differentiation
Kevin Yandoka Denamganaï, Kartic Subr
TL;DR
This work addresses language model inversion (LMI): given a target output, recover an input prompt that elicits that output from a frozen LM. It introduces Differentiable Language Models (DLMs) to enable end-to-end differentiability by treating inputs as distributions over tokens and outputs as distributions, using soft embeddings and a Gumbel-Softmax gradient estimator with decoupled temperature parameters, complemented by Teacher Forcing. The DLMI algorithm jointly optimizes prompt logits and per-token temperatures to produce target outputs, achieving state-of-the-art LMI performance across prompt lengths and model sizes, with robustness to target difficulty. Beyond inversion, the approach enables gradient-based prompt optimization, soft prompt tuning, and targeted prompt engineering, with implications for adversarial auditing and interpretability, while outlining scalability and sampling strategy considerations for future work.
Abstract
Despite emerging research on Language Models (LM), few approaches analyse the invertibility of LMs. That is, given a LM and a desirable target output sequence of tokens, determining what input prompts would yield the target output remains an open problem. We formulate this problem as a classical gradient-based optimisation. First, we propose a simple algorithm to achieve end-to-end differentiability of a given (frozen) LM and then find optimised prompts via gradient descent. Our central insight is to view LMs as functions operating on sequences of distributions over tokens (rather than the traditional view as functions on sequences of tokens). Our experiments and ablations demonstrate that our DLM-powered inversion can reliably and efficiently optimise prompts of lengths $10$ and $80$ for targets of length $20$, for several white-box LMs (out-of-the-box).
