Word Importance Explains How Prompts Affect Language Model Outputs

Stefan Hackmann; Haniyeh Mahmoudian; Mark Steadman; Michael Schmidt

Word Importance Explains How Prompts Affect Language Model Outputs

Stefan Hackmann, Haniyeh Mahmoudian, Mark Steadman, Michael Schmidt

TL;DR

The paper tackles explainability in large language models by quantifying how individual words in system prompts influence outputs across user prompts using a permutation-inspired word-importance method. The core metric for a word is defined as $w(k)=\frac{1}{N M}\sum_{i=1}^{N}\sum_{j=1}^{M} |f(m(s,u_j)) - f(m(s_k,u_j))|$, where $N$ is the number of completions per prompt, $M$ is the number of user inputs, and $f$ is an arbitrary text-score function. Experiments with artificial data and SQuAD 2 questions show positive correlations between the impact of a suffix and the maximum word-importance within that suffix across GPT-3.5 Turbo and Llama2-13B, using scores such as word count, Flesch reading-ease, and topic similarity. The approach is simple, model-agnostic, and useful for bias detection, prompt engineering, and transparent evaluation, with future directions including substitution masking, hierarchical prompt analysis, and extending to other prompt components.

Abstract

The emergence of large language models (LLMs) has revolutionized numerous applications across industries. However, their "black box" nature often hinders the understanding of how they make specific decisions, raising concerns about their transparency, reliability, and ethical use. This study presents a method to improve the explainability of LLMs by varying individual words in prompts to uncover their statistical impact on the model outputs. This approach, inspired by permutation importance for tabular data, masks each word in the system prompt and evaluates its effect on the outputs based on the available text scores aggregated over multiple user inputs. Unlike classical attention, word importance measures the impact of prompt words on arbitrarily-defined text scores, which enables decomposing the importance of words into the specific measures of interest--including bias, reading level, verbosity, etc. This procedure also enables measuring impact when attention weights are not available. To test the fidelity of this approach, we explore the effect of adding different suffixes to multiple different system prompts and comparing subsequent generations with different large language models. Results show that word importance scores are closely related to the expected suffix importances for multiple scoring functions.

Word Importance Explains How Prompts Affect Language Model Outputs

TL;DR

, where

is the number of completions per prompt,

is the number of user inputs, and

is an arbitrary text-score function. Experiments with artificial data and SQuAD 2 questions show positive correlations between the impact of a suffix and the maximum word-importance within that suffix across GPT-3.5 Turbo and Llama2-13B, using scores such as word count, Flesch reading-ease, and topic similarity. The approach is simple, model-agnostic, and useful for bias detection, prompt engineering, and transparent evaluation, with future directions including substitution masking, hierarchical prompt analysis, and extending to other prompt components.

Abstract

Paper Structure (14 sections, 1 equation, 13 figures, 1 table, 1 algorithm)

This paper contains 14 sections, 1 equation, 13 figures, 1 table, 1 algorithm.

Introduction
Related Works
Word Importance Method
Detailed Steps
Experimentation and Results
Setup
Results
Limitations and Future Directions
Conclusion
Appendix
Artificial data
System prompts
Topic Similarity
Large Result Plots

Figures (13)

Figure 1: Illustration of the word importance method. Words from the system prompt are masked with underscore. The masked system prompts, together with user inputs, are passed to an LLM and the outputs are evaluated with arbitrary text scores. The importance score for every word from the prompt with regard to the selected text score is computed by comparing these results with the results from using the original system prompt.
Figure 2: System prompt word importance evaluated by multiple scores. Each word is masked with to compute its word importance score. Using multiple scores simultaneously allows one to conveniently observe the multifaceted impact of each word on the model output.
Figure 3: Actual suffix importance vs maximum importance from suffix. For each suffix, the word importance has been calculated using scoring functions "word count", "Flesch reading-ease", and "topic similarity". We can clearly see how the output from GPT-3.5 Turbo clusters into a set of outputs where the suffix impact is roughly of the size of the maximum word importance and into a set where the maximum word importance is significantly greater.
Figure 4: Actual suffix importance vs maximum importance from suffix. For each suffix, the word importance has been calculated using scoring functions "word count", "Flesch reading-ease", and "topic similarity".
Figure 5: Actual suffix importance vs maximum importance from suffix. For each suffix, the word importance has been calculated using scoring functions "word count", "Flesch reading-ease", and "topic similarity".
...and 8 more figures

Word Importance Explains How Prompts Affect Language Model Outputs

TL;DR

Abstract

Word Importance Explains How Prompts Affect Language Model Outputs

Authors

TL;DR

Abstract

Table of Contents

Figures (13)