Table of Contents
Fetching ...

Online Selective Generation with Adversarial Bandit Feedback

Minjae Lee, Yoonjae Jung, Sangdon Park

TL;DR

This work tackles online selective generation under partial feedback in adversarial environments by casting abstention thresholds as bandit arms and proving a Regret-to-FDR conversion that translates regret bounds into FDR guarantees. It then extends Exp3-IX with a feedback unlocking mechanism (ExSUL) to exploit partial feedback more effectively, achieving $Reg_T = \mathcal{O}(\sqrt{T\ln|\mathcal{H}|})$ and corresponding sublinear FDR risk $\mathcal{R}_T^{\textbf{FDR}}$. Empirical evaluation across stochastic, distribution-shift, and interactive settings demonstrates that ExSUL can control the empirical FDR at a desired level $\alpha$ while maintaining reasonable selection efficiency, outperforming or matching baselines in various setups. The framework advances selective generation by providing theoretical FDR guarantees in non-stochastic, online, and partially observable environments, with practical implications for reducing hallucinations in real-time language generation. In short, the paper presents a theoretically grounded and empirically validated approach to online selective generation under partial feedback, enabling robust control of hallucinations in adversarial and dynamic contexts.

Abstract

Large language generative models increasingly interact with humans, while their falsified responses raise concerns. To mitigate this hallucination effect, selectively abstaining from answering, called selective generation, provides an effective way for generators to control the hallucination when uncertain about their answers. However, as selective generators interact under adversarial environments and receive partial feedback from users on selected generation (e.g., thumbs up or down on the selected answer), learning methods for selective generation under such practical setups are crucial but currently missing. To address this limitation, we propose an online learning algorithm for selective generation with partial feedback under an adaptive adversary. In particular, we re-purpose an adversarial bandit algorithm to design an online selective generation method with controllable false discovery rates (FDR), which measures the rate of hallucination. The key building blocks include a novel conversion lemma from regret of any bandit algorithm to the FDR, and the exploitation of a unique structure of selective generation to reuse partial feedback, which we call feedback unlocking. We empirically evaluate the efficacy of the proposed online selective generation algorithm with partial feedback over diverse learning environments, demonstrating its ability to control the FDR, while maintaining reasonable selection efficiency, i.e., the ratio of non-abstaining answers, compared to baselines.

Online Selective Generation with Adversarial Bandit Feedback

TL;DR

This work tackles online selective generation under partial feedback in adversarial environments by casting abstention thresholds as bandit arms and proving a Regret-to-FDR conversion that translates regret bounds into FDR guarantees. It then extends Exp3-IX with a feedback unlocking mechanism (ExSUL) to exploit partial feedback more effectively, achieving and corresponding sublinear FDR risk . Empirical evaluation across stochastic, distribution-shift, and interactive settings demonstrates that ExSUL can control the empirical FDR at a desired level while maintaining reasonable selection efficiency, outperforming or matching baselines in various setups. The framework advances selective generation by providing theoretical FDR guarantees in non-stochastic, online, and partially observable environments, with practical implications for reducing hallucinations in real-time language generation. In short, the paper presents a theoretically grounded and empirically validated approach to online selective generation under partial feedback, enabling robust control of hallucinations in adversarial and dynamic contexts.

Abstract

Large language generative models increasingly interact with humans, while their falsified responses raise concerns. To mitigate this hallucination effect, selectively abstaining from answering, called selective generation, provides an effective way for generators to control the hallucination when uncertain about their answers. However, as selective generators interact under adversarial environments and receive partial feedback from users on selected generation (e.g., thumbs up or down on the selected answer), learning methods for selective generation under such practical setups are crucial but currently missing. To address this limitation, we propose an online learning algorithm for selective generation with partial feedback under an adaptive adversary. In particular, we re-purpose an adversarial bandit algorithm to design an online selective generation method with controllable false discovery rates (FDR), which measures the rate of hallucination. The key building blocks include a novel conversion lemma from regret of any bandit algorithm to the FDR, and the exploitation of a unique structure of selective generation to reuse partial feedback, which we call feedback unlocking. We empirically evaluate the efficacy of the proposed online selective generation algorithm with partial feedback over diverse learning environments, demonstrating its ability to control the FDR, while maintaining reasonable selection efficiency, i.e., the ratio of non-abstaining answers, compared to baselines.

Paper Structure

This paper contains 56 sections, 9 theorems, 86 equations, 29 figures, 2 tables, 6 algorithms.

Key Result

Lemma 1

Let $T \in \mathbb{N}$ and $\alpha \in (0, 1)$. For any $(\mathbf{x}_t, \mathbf{y}_t)$ sequences, leading to any loss sequences $\ell_t$ of (eq:loss-sg), we have where the last equality holds if we take $\lambda = 1 / T^{1/4}$.

Figures (29)

  • Figure 1: Qualitative examples from interactive dialog simulation. This demonstrates that our proposed method ExSUL effectively controls the rate of hallucination in the FDR by abstaining from answering under a practical online setup with partial feedback. See Section \ref{['sec:exp:interactive']} for details.
  • Figure 2: An example of our proposed framework for online selective generation. At each step $t$, (1) the user provides an input $\mathbf{x}_t$, (2) the learner selects an arm $\tau_t \sim p_t$, (3-4) selectively generates $\hat{{S}}_t(\mathbf{x}_t; \tau_t)$ , (5-6) the user provides partial feedback $e_t$, (7) the loss $\ell_t(\tau_t)$ is computed from $e_t$, and (8) the learner update $p_t$ by the loss estimator $\tilde{\ell}_t$. Note that the user can be modeled as an adversary.
  • Figure 3: Comparison of selective generation methods under a stochastic environment with LLaMA3.1-8B-Instruct as a generator on TriviaQA ($T=30\mathrm{K}, \alpha=0.08$). The violin plots are drawn with randomly chosen $30\mathrm{K}$ samples over $100$ random trials.
  • Figure 4: Comparison of selective generation methods under a single distribution-shift environment with GPT-3.5-turbo as a generator ($T = 30\mathrm{K}, \alpha = 0.1$), from TriviaQA to NQ. The violin plots are drawn with randomly chosen $30\mathrm{K}$ samples with $100$ random trials.
  • Figure 5: Comparison of selective generation methods under an alternating distribution-shift environment with GPT-3.5-turbo as a generator ($T=30\mathrm{K}, \alpha=0.1$), alternating between TriviaQA and NQ, starting with TriviaQA. The violin plots are drawn with randomly chosen $30\mathrm{K}$ samples with $100$ random trials.
  • ...and 24 more figures

Theorems & Definitions (14)

  • Lemma 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Lemma 2
  • proof
  • Corollary 1
  • proof
  • Remark
  • Theorem 4
  • ...and 4 more