Table of Contents
Fetching ...

Extracting Prompts by Inverting LLM Outputs

Collin Zhang, John X. Morris, Vitaly Shmatikov

TL;DR

This work tackles language model inversion under black-box access by inferring hidden prompts from multiple LLM outputs. It introduces output2prompt, a sparse-encoder, encoder-decoder inversion architecture that operates on normal (non-adversarial) outputs, achieving high semantic fidelity and strong cross-LLM transfer with far less data than prior logits-based methods. The results demonstrate robust semantic similarity across diverse user and system prompts and datasets, reveal the practical risk of prompt cloning without adversarial queries, and discuss efficiency benefits and defenses. Overall, the approach significantly improves sample efficiency and generalizability while highlighting vulnerabilities in prompt confidentiality for deployed LLM applications.

Abstract

We consider the problem of language model inversion: given outputs of a language model, we seek to extract the prompt that generated these outputs. We develop a new black-box method, output2prompt, that learns to extract prompts without access to the model's logits and without adversarial or jailbreaking queries. In contrast to previous work, output2prompt only needs outputs of normal user queries. To improve memory efficiency, output2prompt employs a new sparse encoding techique. We measure the efficacy of output2prompt on a variety of user and system prompts and demonstrate zero-shot transferability across different LLMs.

Extracting Prompts by Inverting LLM Outputs

TL;DR

This work tackles language model inversion under black-box access by inferring hidden prompts from multiple LLM outputs. It introduces output2prompt, a sparse-encoder, encoder-decoder inversion architecture that operates on normal (non-adversarial) outputs, achieving high semantic fidelity and strong cross-LLM transfer with far less data than prior logits-based methods. The results demonstrate robust semantic similarity across diverse user and system prompts and datasets, reveal the practical risk of prompt cloning without adversarial queries, and discuss efficiency benefits and defenses. Overall, the approach significantly improves sample efficiency and generalizability while highlighting vulnerabilities in prompt confidentiality for deployed LLM applications.

Abstract

We consider the problem of language model inversion: given outputs of a language model, we seek to extract the prompt that generated these outputs. We develop a new black-box method, output2prompt, that learns to extract prompts without access to the model's logits and without adversarial or jailbreaking queries. In contrast to previous work, output2prompt only needs outputs of normal user queries. To improve memory efficiency, output2prompt employs a new sparse encoding techique. We measure the efficacy of output2prompt on a variety of user and system prompts and demonstrate zero-shot transferability across different LLMs.
Paper Structure (59 sections, 4 equations, 3 figures, 7 tables)

This paper contains 59 sections, 4 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Overview: given outputs sampled from an LLM, our inversion model generates the prompt.
  • Figure 2: Prompt extraction quality vs. the number of LLM outputs provided to the inverter. The inverter was trained to extract prompts from the Llama-family models only; all non-blue lines measure output2prompt's ability to transfer to unseen model families.
  • Figure 3: Loss curves of the inversion model trained on 16 outputs for one epoch, 3 different methods.