Table of Contents
Fetching ...

Safe and Efficient In-Context Learning via Risk Control

Andrea Wynn, Metod Jazbec, Charith Peris, Rinat Khaziev, Anqi Liu, Daniel Khashabi, Eric Nalisnick

TL;DR

This work addresses safety in in-context learning by bounding harmful influence below a safe zero-shot baseline. It introduces a distribution-free risk control (DFRC) framework combined with dynamic early exits to cap risk through a per-example threshold $\lambda$, while preserving benefits from helpful demonstrations and improving efficiency. The authors define a safe ICL predictor, an overthinking loss $\ell_{ICL}$, and a risk-transformation adaptation of Learn-Then-Test to handle non-monotonic and negative losses, offering theoretical guarantees and empirical risk control across eight tasks and four models with substantial speedups. Overall, the approach provides a principled mechanism to manage mixed-quality prompts, enabling safer and more efficient deployment of LLMs in real-world settings.

Abstract

Large language models (LLMs) demonstrate a remarkable ability to learn new tasks from a few in-context examples. However, this flexibility introduces safety concerns: LLMs can be influenced by incorrect or malicious demonstrations -- for example, if an adversary tampers with or injects harmful examples without a human supervisor noticing. This motivates principled designs in which the system itself includes built-in mechanisms to guard against such attacks. We propose a novel approach to limit the degree to which harmful demonstrations can degrade model performance. First, we define a baseline ``safe'' behavior for the model -- the model's performance given no in-context demonstrations (zero-shot). Next, we apply distribution-free risk control (DFRC) to control the extent to which in-context samples can decay performance below zero-shot. We achieve this by leveraging dynamic early exit prediction, ignoring later attention heads that attend the most to the unsafe inputs. Finally, we propose modifications to DFRC that allow it to both control risk for harmful inputs \textit{and} leverage performance and efficiency gains on helpful inputs. We present both theoretical and empirical results showing that our approach can effectively control risk for harmful in-context demonstrations while simultaneously achieving substantial computational efficiency gains with helpful demonstrations.

Safe and Efficient In-Context Learning via Risk Control

TL;DR

This work addresses safety in in-context learning by bounding harmful influence below a safe zero-shot baseline. It introduces a distribution-free risk control (DFRC) framework combined with dynamic early exits to cap risk through a per-example threshold , while preserving benefits from helpful demonstrations and improving efficiency. The authors define a safe ICL predictor, an overthinking loss , and a risk-transformation adaptation of Learn-Then-Test to handle non-monotonic and negative losses, offering theoretical guarantees and empirical risk control across eight tasks and four models with substantial speedups. Overall, the approach provides a principled mechanism to manage mixed-quality prompts, enabling safer and more efficient deployment of LLMs in real-world settings.

Abstract

Large language models (LLMs) demonstrate a remarkable ability to learn new tasks from a few in-context examples. However, this flexibility introduces safety concerns: LLMs can be influenced by incorrect or malicious demonstrations -- for example, if an adversary tampers with or injects harmful examples without a human supervisor noticing. This motivates principled designs in which the system itself includes built-in mechanisms to guard against such attacks. We propose a novel approach to limit the degree to which harmful demonstrations can degrade model performance. First, we define a baseline ``safe'' behavior for the model -- the model's performance given no in-context demonstrations (zero-shot). Next, we apply distribution-free risk control (DFRC) to control the extent to which in-context samples can decay performance below zero-shot. We achieve this by leveraging dynamic early exit prediction, ignoring later attention heads that attend the most to the unsafe inputs. Finally, we propose modifications to DFRC that allow it to both control risk for harmful inputs \textit{and} leverage performance and efficiency gains on helpful inputs. We present both theoretical and empirical results showing that our approach can effectively control risk for harmful in-context demonstrations while simultaneously achieving substantial computational efficiency gains with helpful demonstrations.

Paper Structure

This paper contains 42 sections, 6 equations, 21 figures.

Figures (21)

  • Figure 1: (a) A LLM is given in-context demonstrations of unknown quality (helpful or harmful). The model needs to infer whether to rely on the answer it obtains using the given demonstrations without knowing ahead of time if they are helpful or not. If not, it falls back to the answer it would give without seeing any demonstrations at all (zero-shot). (b) When given incorrect demonstrations, it is better to either early-exit or simply not use the given demonstrations than to use the model's final prediction -- staying in the "safe" performance range between zero-shot and correct demonstrations.
  • Figure 2: With an early-exit LLM, we execute every layer until our confidence exceeds the $\lambda$ threshold, after which we directly make a prediction from the intermediate layer.
  • Figure 3: Some choices of $\lambda$ thresholds can both attain performance gains from correct demonstrations and control overthinking from incorrect demonstrations. The highlighted regions show where $\lambda$ values exist such that we lose no more than 5% of the accuracy gains from correct demonstrations while still doing better than the full model given incorrect demonstrations.
  • Figure 4: We show the distribution of our ICL loss on the TweetEval-Hate dataset with a 50% mix of correct and incorrect demonstrations. There are a significant number of negative loss values, which the loss-clipping approach sets to 0. Our risk transformation approach enables us to preserve the original underlying loss distribution.
  • Figure 5: Empirical risk vs the user-specified risk level $\epsilon$ using our safe ICL model and $\ell_{\text{ICL}}$ loss over a set of mixed correct and incorrect demonstrations. Aligning with the theoretical guarantees, the risk is controlled across all models and tasks. Shaded regions correspond to one standard error over 100 experiments and are included on all plots.
  • ...and 16 more figures