Table of Contents
Fetching ...

Eliciting Textual Descriptions from Representations of Continuous Prompts

Dana Ramati, Daniela Gottesman, Mor Geva

TL;DR

This work proposes a new approach to interpret continuous prompts that elicits textual descriptions from their representations during model inference and shows its method often yields accurate task descriptions which become more faithful as task performance increases.

Abstract

Continuous prompts, or "soft prompts", are a widely-adopted parameter-efficient tuning strategy for large language models, but are often less favorable due to their opaque nature. Prior attempts to interpret continuous prompts relied on projecting individual prompt tokens onto the vocabulary space. However, this approach is problematic as performant prompts can yield arbitrary or contradictory text, and it interprets prompt tokens individually. In this work, we propose a new approach to interpret continuous prompts that elicits textual descriptions from their representations during model inference. Using a Patchscopes variant (Ghandeharioun et al., 2024) called InSPEcT over various tasks, we show our method often yields accurate task descriptions which become more faithful as task performance increases. Moreover, an elaborated version of InSPEcT reveals biased features in continuous prompts, whose presence correlates with biased model predictions. Providing an effective interpretability solution, InSPEcT can be leveraged to debug unwanted properties in continuous prompts and inform developers on ways to mitigate them.

Eliciting Textual Descriptions from Representations of Continuous Prompts

TL;DR

This work proposes a new approach to interpret continuous prompts that elicits textual descriptions from their representations during model inference and shows its method often yields accurate task descriptions which become more faithful as task performance increases.

Abstract

Continuous prompts, or "soft prompts", are a widely-adopted parameter-efficient tuning strategy for large language models, but are often less favorable due to their opaque nature. Prior attempts to interpret continuous prompts relied on projecting individual prompt tokens onto the vocabulary space. However, this approach is problematic as performant prompts can yield arbitrary or contradictory text, and it interprets prompt tokens individually. In this work, we propose a new approach to interpret continuous prompts that elicits textual descriptions from their representations during model inference. Using a Patchscopes variant (Ghandeharioun et al., 2024) called InSPEcT over various tasks, we show our method often yields accurate task descriptions which become more faithful as task performance increases. Moreover, an elaborated version of InSPEcT reveals biased features in continuous prompts, whose presence correlates with biased model predictions. Providing an effective interpretability solution, InSPEcT can be leveraged to debug unwanted properties in continuous prompts and inform developers on ways to mitigate them.

Paper Structure

This paper contains 32 sections, 2 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: InSPEcT interprets a continuous prompt by patching the prompt representations (top) into an inference pass that generates a task description (bottom).
  • Figure 2: Prompt interpretability as a function of task accuracy for LLaMA2. The Class Rate/ROUGE-1 scores are averaged over all the prompts within the accuracy bin. For each task and token length, the scores increase with the performance of the prompt. Results for LLaMA3 show similar trends (§\ref{['sec:additional_results']}).
  • Figure 3: Differences in counts of each word group in generated outputs during training with respect to randomly-initialized prompts (epoch 0). The distributions are aggregated over 10 continuous prompts trained on SNLI bowman-etal-2015-large.
  • Figure 4: Histograms of the counts of generated biased words across different prompt bias levels. Outputs with biased words $(>0)$ show positive predictive bias, while those without $(=0)$ are unbiased on average. The x-axis is cut to $[-10, 20]$ for brevity, omitting outliers.
  • Figure 5: Prompt interpretability as a function of task accuracy for LLaMA3. The Class Rate/ROUGE-1 scores are averaged over all the prompts within the accuracy bin.
  • ...and 1 more figures