Table of Contents
Fetching ...

Mitigating the Influence of Distractor Tasks in LMs with Prior-Aware Decoding

Raymond Douglas, Andis Draguns, Tomáš Gavenčiak

TL;DR

The paper tackles the problem of distractor tasks in language models, including prompt injections and inverse-scaling effects, by framing LMs as products of experts and introducing Prior-Aware Decoding (PAD). PAD performs inference-time contrastive decoding through a linear logit combination $L = L_O + α (L_O - L_W)$ using two prompts (original and weakened) to bias outputs toward the intended task without retraining. Empirically, PAD yields robust improvements across 11 models and 4 task sets, with 41 of 44 task-model combinations showing gains and a median 40% increase in task completion at α = 2. This work provides both a practical technique for more reliable LMs and a theoretical lens on how strong priors and distractor tasks arise, with potential implications for prompt-injection defenses and broader model elicitation strategies.

Abstract

The broad capabilities of Language Models (LMs) can be limited by their sensitivity to distractor tasks: LMs can infer secondary tasks from the prompt in addition to the intended one, leading to unwanted outputs. For example, prompt injection attacks can cause models to deviate from explicit directives. In some 'inverse scaling' cases, this unwanted behaviour actually worsens as models scale up to at least 540B parameters. We present a theoretical framework that interprets LMs as a product of experts that combine multiple data generation processes. Based on this framework, we demonstrate prior-aware decoding (PAD) - a simple contrastive inference method to reduce the influence of distractor tasks. We apply PAD to eleven models, across four datasets, and find improvements in 41 out of 44 task-model combinations, with a median increase in task completion proportion of 40%. The results suggest a promising direction for further development towards more reliable language models.

Mitigating the Influence of Distractor Tasks in LMs with Prior-Aware Decoding

TL;DR

The paper tackles the problem of distractor tasks in language models, including prompt injections and inverse-scaling effects, by framing LMs as products of experts and introducing Prior-Aware Decoding (PAD). PAD performs inference-time contrastive decoding through a linear logit combination using two prompts (original and weakened) to bias outputs toward the intended task without retraining. Empirically, PAD yields robust improvements across 11 models and 4 task sets, with 41 of 44 task-model combinations showing gains and a median 40% increase in task completion at α = 2. This work provides both a practical technique for more reliable LMs and a theoretical lens on how strong priors and distractor tasks arise, with potential implications for prompt-injection defenses and broader model elicitation strategies.

Abstract

The broad capabilities of Language Models (LMs) can be limited by their sensitivity to distractor tasks: LMs can infer secondary tasks from the prompt in addition to the intended one, leading to unwanted outputs. For example, prompt injection attacks can cause models to deviate from explicit directives. In some 'inverse scaling' cases, this unwanted behaviour actually worsens as models scale up to at least 540B parameters. We present a theoretical framework that interprets LMs as a product of experts that combine multiple data generation processes. Based on this framework, we demonstrate prior-aware decoding (PAD) - a simple contrastive inference method to reduce the influence of distractor tasks. We apply PAD to eleven models, across four datasets, and find improvements in 41 out of 44 task-model combinations, with a median increase in task completion proportion of 40%. The results suggest a promising direction for further development towards more reliable language models.
Paper Structure (15 sections, 16 equations, 6 figures, 3 tables)

This paper contains 15 sections, 16 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Illustration of our geometric mixture model, also referred to as product of experts, and Prior-Aware Decoding in a task requiring a modification of a very common sequence.
  • Figure 2: Overview of the results of the Prior-Aware Decoding method at different values of parameter $\alpha$ over 4 experimental task sets and 11 language models. Each point refers to performance on one (task set, model) pair, showing the baseline performance of the unmodified original model ($x$-axis) vs PAD using the "truncated prompt" weakening ($y$-axis). See Table \ref{['table:results']} for details.
  • Figure 3: Mean probability of correct completion depending on the extrapolation parameter $\alpha$. The plots combine accuracy at two different temperatures (0.0 and 1.0) and using two different methods (leaving out task description and adding a common system prompt). The range of $\alpha$ is wider than the plausibly practical range to illustrate the trends.
  • Figure 4: Mean probability of correct completion depending on the extrapolation parameter $\alpha$; continuation of Figure \ref{['fig:results-alpha-gpt2']} for several models of the GPT-3 family used via the OpenAI API, extrapolating from the likelihoods of the top 5 predicted tokens returned by the API (note that this is a limitation of the OpenAI service).
  • Figure 5: Overview of the results of our method at different values of parameter $\alpha$. Each datapoint refers to performance on one task-model pair, showing original performance ($x$-axis) vs logit extrapolation between two different models ($y$-axis).
  • ...and 1 more figures