Online Learning from Strategic Human Feedback in LLM Fine-Tuning
Shugang Hao, Lingjie Duan
TL;DR
This work addresses the risk of strategic misreporting in online reinforcement learning from human feedback for LLM fine-tuning. It models the interaction as a dynamic Bayesian game and proposes a non-monetary Online Weighted Aggregation Mechanism that dynamically reweights human labelers based on feedback accuracy, ensuring truthful reporting and a sublinear regret bound of $O(T^{1/2})$. The mechanism outperforms standard average and median aggregation schemes, which suffer from non-vanishing regret. Theoretical guarantees, together with simulations on real-world-like data, demonstrate improved alignment of LLM outputs with diverse human preferences while mitigating long-term manipulation incentives.
Abstract
Reinforcement learning from human feedback (RLHF) has become an essential step in fine-tuning large language models (LLMs) to align them with human preferences. However, human labelers are selfish and have diverse preferences. They may strategically misreport their online feedback to influence the system's aggregation towards their own preferences. Current practice simply averages labelers' feedback per time and fails to identify the most accurate human labeler, leading to linear regret $\mathcal{O}(T)$ for $T$ time slots. To our best knowledge, we are the first to study online learning mechanisms against strategic human labelers in the LLM fine-tuning process. We formulate a new dynamic Bayesian game and dynamically adjust human labelers' weights in the preference aggregation, ensuring their truthful feedback and sublinear regret $\mathcal{O}(T^{1/2})$. Simulation results demonstrate our mechanism's great advantages over the existing benchmark schemes.
