Online Selective Generation with Adversarial Bandit Feedback
Minjae Lee, Yoonjae Jung, Sangdon Park
TL;DR
This work tackles online selective generation under partial feedback in adversarial environments by casting abstention thresholds as bandit arms and proving a Regret-to-FDR conversion that translates regret bounds into FDR guarantees. It then extends Exp3-IX with a feedback unlocking mechanism (ExSUL) to exploit partial feedback more effectively, achieving $Reg_T = \mathcal{O}(\sqrt{T\ln|\mathcal{H}|})$ and corresponding sublinear FDR risk $\mathcal{R}_T^{\textbf{FDR}}$. Empirical evaluation across stochastic, distribution-shift, and interactive settings demonstrates that ExSUL can control the empirical FDR at a desired level $\alpha$ while maintaining reasonable selection efficiency, outperforming or matching baselines in various setups. The framework advances selective generation by providing theoretical FDR guarantees in non-stochastic, online, and partially observable environments, with practical implications for reducing hallucinations in real-time language generation. In short, the paper presents a theoretically grounded and empirically validated approach to online selective generation under partial feedback, enabling robust control of hallucinations in adversarial and dynamic contexts.
Abstract
Large language generative models increasingly interact with humans, while their falsified responses raise concerns. To mitigate this hallucination effect, selectively abstaining from answering, called selective generation, provides an effective way for generators to control the hallucination when uncertain about their answers. However, as selective generators interact under adversarial environments and receive partial feedback from users on selected generation (e.g., thumbs up or down on the selected answer), learning methods for selective generation under such practical setups are crucial but currently missing. To address this limitation, we propose an online learning algorithm for selective generation with partial feedback under an adaptive adversary. In particular, we re-purpose an adversarial bandit algorithm to design an online selective generation method with controllable false discovery rates (FDR), which measures the rate of hallucination. The key building blocks include a novel conversion lemma from regret of any bandit algorithm to the FDR, and the exploitation of a unique structure of selective generation to reuse partial feedback, which we call feedback unlocking. We empirically evaluate the efficacy of the proposed online selective generation algorithm with partial feedback over diverse learning environments, demonstrating its ability to control the FDR, while maintaining reasonable selection efficiency, i.e., the ratio of non-abstaining answers, compared to baselines.
