Table of Contents
Fetching ...

A Regression Framework for Understanding Prompt Component Impact on LLM Performance

Andrew Lauziere, Jonathan Daugherty, Taisa Kushner

Abstract

As large language models (LLMs) continue to improve and see further integration into software systems, so does the need to understand the conditions in which they will perform. We contribute a statistical framework for understanding the impact of specific prompt features on LLM performance. The approach extends previous explainable artificial intelligence (XAI) methods specifically to inspect LLMs by fitting regression models relating portions of the prompt to LLM evaluation. We apply our method to compare how two open-source models, Mistral-7B and GPT-OSS-20B, leverage the prompt to perform a simple arithmetic problem. Regression models of individual prompt portions explain 72% and 77% of variation in model performances, respectively. We find misinformation in the form of incorrect example query-answer pairs impedes both models from solving the arithmetic query, though positive examples do not find significant variability in the impact of positive and negative instructions - these prompts have contradictory effects on model performance. The framework serves as a tool for decision makers in critical scenarios to gain granular insight into how the prompt influences an LLM to solve a task.

A Regression Framework for Understanding Prompt Component Impact on LLM Performance

Abstract

As large language models (LLMs) continue to improve and see further integration into software systems, so does the need to understand the conditions in which they will perform. We contribute a statistical framework for understanding the impact of specific prompt features on LLM performance. The approach extends previous explainable artificial intelligence (XAI) methods specifically to inspect LLMs by fitting regression models relating portions of the prompt to LLM evaluation. We apply our method to compare how two open-source models, Mistral-7B and GPT-OSS-20B, leverage the prompt to perform a simple arithmetic problem. Regression models of individual prompt portions explain 72% and 77% of variation in model performances, respectively. We find misinformation in the form of incorrect example query-answer pairs impedes both models from solving the arithmetic query, though positive examples do not find significant variability in the impact of positive and negative instructions - these prompts have contradictory effects on model performance. The framework serves as a tool for decision makers in critical scenarios to gain granular insight into how the prompt influences an LLM to solve a task.

Paper Structure

This paper contains 14 sections, 7 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of IAMs. First in panel A, a prompt model describes texts of interest which could impact LLM performance on the query "3+2=" (blue). The first stratum (purple) contains two choices of task-specific description texts while five strata (red) each have an example query-answer pair. An LLM processes and scores the full set of possible subprompts, each an input prompt containing a subset of prompt components (panel B). The inclusion (modeled with a 1) or exclusion (modeled with a 0) of components is expressed with binary variables (panel C). Pairs of encoding vectors and scores are used to fit a regression model, such as the one written in panel D. Panel E shows an example visualization of the learned model parameters conveying importance for LLM performance.
  • Figure 2: GPT-OSS-20B was more sensitive to the prompt than Mistral-7B. Left: Mistral-7B's DCPMI across all 8192 subprompts (red) showed a unimodal distribution whereas GPT-OSS-20B's DCPMI distribution was bimodal (black). The baseline DCPMI, 1.0 for both models, stands at the center of the Mistral-7B distribution. Right: Quantile-quantile (QQ) plots for Mistral (residuals in red, line of best fit in blue) and OSS (residuals in black, line of best fit in green) from first-order regression models. Overall, the unary model for OSS yielded more extreme residuals (y-axis values below -3 and above 3) than Mistral. The OSS residual distribution also showed deviation from normality as the tails of the distribution had less mass than expected (black dots under the green line).
  • Figure 3: LASSO trades off fidelity for interpretability. LASSO penalized model size (green curves, measured $\|\hat{\boldsymbol{\beta}}_\lambda\|_2$) with increasing $\lambda$. As models shrank, performance lowered and MSE (blue curves) rose, yielding a trade-off. The grid search procedure identified an optimal $\lambda^* \approx 0.0004$ for both models. At these "elbows" of the green curves, model reduction began to slow.
  • Figure 4: LASSO penalized interaction terms more than first-order terms in both Mistral and OSS. Model sizes, $\|\hat{\boldsymbol{\beta}}_\lambda\|_2$, were stratified by level of interaction. Blue curves show Intercept (baseline) estimates for each $\lambda$; OSS first-order estimates remained larger than the baseline whereas all prompt component parameter estimates fell below the baseline value for Mistral, suggesting that OSS used prompt components more than Mistral, relative to baseline query DCPMIs.