Table of Contents
Fetching ...

Language Model Inversion through End-to-End Differentiation

Kevin Yandoka Denamganaï, Kartic Subr

TL;DR

This work addresses language model inversion (LMI): given a target output, recover an input prompt that elicits that output from a frozen LM. It introduces Differentiable Language Models (DLMs) to enable end-to-end differentiability by treating inputs as distributions over tokens and outputs as distributions, using soft embeddings and a Gumbel-Softmax gradient estimator with decoupled temperature parameters, complemented by Teacher Forcing. The DLMI algorithm jointly optimizes prompt logits and per-token temperatures to produce target outputs, achieving state-of-the-art LMI performance across prompt lengths and model sizes, with robustness to target difficulty. Beyond inversion, the approach enables gradient-based prompt optimization, soft prompt tuning, and targeted prompt engineering, with implications for adversarial auditing and interpretability, while outlining scalability and sampling strategy considerations for future work.

Abstract

Despite emerging research on Language Models (LM), few approaches analyse the invertibility of LMs. That is, given a LM and a desirable target output sequence of tokens, determining what input prompts would yield the target output remains an open problem. We formulate this problem as a classical gradient-based optimisation. First, we propose a simple algorithm to achieve end-to-end differentiability of a given (frozen) LM and then find optimised prompts via gradient descent. Our central insight is to view LMs as functions operating on sequences of distributions over tokens (rather than the traditional view as functions on sequences of tokens). Our experiments and ablations demonstrate that our DLM-powered inversion can reliably and efficiently optimise prompts of lengths $10$ and $80$ for targets of length $20$, for several white-box LMs (out-of-the-box).

Language Model Inversion through End-to-End Differentiation

TL;DR

This work addresses language model inversion (LMI): given a target output, recover an input prompt that elicits that output from a frozen LM. It introduces Differentiable Language Models (DLMs) to enable end-to-end differentiability by treating inputs as distributions over tokens and outputs as distributions, using soft embeddings and a Gumbel-Softmax gradient estimator with decoupled temperature parameters, complemented by Teacher Forcing. The DLMI algorithm jointly optimizes prompt logits and per-token temperatures to produce target outputs, achieving state-of-the-art LMI performance across prompt lengths and model sizes, with robustness to target difficulty. Beyond inversion, the approach enables gradient-based prompt optimization, soft prompt tuning, and targeted prompt engineering, with implications for adversarial auditing and interpretability, while outlining scalability and sampling strategy considerations for future work.

Abstract

Despite emerging research on Language Models (LM), few approaches analyse the invertibility of LMs. That is, given a LM and a desirable target output sequence of tokens, determining what input prompts would yield the target output remains an open problem. We formulate this problem as a classical gradient-based optimisation. First, we propose a simple algorithm to achieve end-to-end differentiability of a given (frozen) LM and then find optimised prompts via gradient descent. Our central insight is to view LMs as functions operating on sequences of distributions over tokens (rather than the traditional view as functions on sequences of tokens). Our experiments and ablations demonstrate that our DLM-powered inversion can reliably and efficiently optimise prompts of lengths and for targets of length , for several white-box LMs (out-of-the-box).
Paper Structure (35 sections, 6 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 35 sections, 6 equations, 5 figures, 2 tables, 2 algorithms.

Figures (5)

  • Figure 1: Our abstracted view of (a) existing LMs; (b) our differentiable LM and (c) gradients through sampling bengio2013estimating.
  • Figure 2: Plots of accuracy (LCS Ratios) vs target difficulty $k$ for SmolLM2-135M (top) and SmolLM3-3B (bottom) results after 256 (left) or 2048 (right) optimization steps with $N=80$ and $M=20$.
  • Figure 3: Plots of accuracy (LCS Ratios) vs target difficulty $k$ for SmolLM2-135M (top) and SmolLM3-3B (bottom) results after 256 (left) or 2048 (right) optimization steps with $N=10$ and $M=20$.
  • Figure 4: Mean $\pm$ std.err. of the (mean over prompt token - when decoupled) effective temperature $\tau$ for SmolLM2-135M over $256$ optimization steps, with $N=80$.
  • Figure 5: Violin plots (with mean and extrema) in log scale of maximal gradient variance (left) and maximal gradient bias (right) for SmolLM2-135M over 256 optimization steps with $N=80$ and $M=20$ (maximal is taken over learnable logits $M\cdot |\mathcal{V}|$ at each optimization step). D stands for decoupled learnable temperature, L stands for learnable temperature.