Mitigating the Influence of Distractor Tasks in LMs with Prior-Aware Decoding
Raymond Douglas, Andis Draguns, Tomáš Gavenčiak
TL;DR
The paper tackles the problem of distractor tasks in language models, including prompt injections and inverse-scaling effects, by framing LMs as products of experts and introducing Prior-Aware Decoding (PAD). PAD performs inference-time contrastive decoding through a linear logit combination $L = L_O + α (L_O - L_W)$ using two prompts (original and weakened) to bias outputs toward the intended task without retraining. Empirically, PAD yields robust improvements across 11 models and 4 task sets, with 41 of 44 task-model combinations showing gains and a median 40% increase in task completion at α = 2. This work provides both a practical technique for more reliable LMs and a theoretical lens on how strong priors and distractor tasks arise, with potential implications for prompt-injection defenses and broader model elicitation strategies.
Abstract
The broad capabilities of Language Models (LMs) can be limited by their sensitivity to distractor tasks: LMs can infer secondary tasks from the prompt in addition to the intended one, leading to unwanted outputs. For example, prompt injection attacks can cause models to deviate from explicit directives. In some 'inverse scaling' cases, this unwanted behaviour actually worsens as models scale up to at least 540B parameters. We present a theoretical framework that interprets LMs as a product of experts that combine multiple data generation processes. Based on this framework, we demonstrate prior-aware decoding (PAD) - a simple contrastive inference method to reduce the influence of distractor tasks. We apply PAD to eleven models, across four datasets, and find improvements in 41 out of 44 task-model combinations, with a median increase in task completion proportion of 40%. The results suggest a promising direction for further development towards more reliable language models.
