Table of Contents
Fetching ...

The Real, the Better: Aligning Large Language Models with Online Human Behaviors

Guanying Jiang, Lingyong Yan, Haibo Shi, Dawei Yin

TL;DR

This work introduces RLHB, a framework to align LLMs with real-time online human behaviors by treating the problem as adversarial learning between a generator and a behavior-conditioned discriminator. By modeling user behaviors as natural-language signals and using a multi-model, jointly trained setup, RLHB enables continuous online adaptation without relying solely on offline annotations. Experimental results show that online-behavior alignment can achieve gains comparable to offline preference methods and, in some setups, augment RLHF performance, while providing practical benefits in dynamic environments. The approach offers a scalable path for deploying LLMs that remain aligned with evolving user preferences in real-world online platforms, albeit with considerations around data quality, sample efficiency, and training stability.

Abstract

Large language model alignment is widely used and studied to avoid LLM producing unhelpful and harmful responses. However, the lengthy training process and predefined preference bias hinder adaptation to online diverse human preferences. To this end, this paper proposes an alignment framework, called Reinforcement Learning with Human Behavior (RLHB), to align LLMs by directly leveraging real online human behaviors. By taking the generative adversarial framework, the generator is trained to respond following expected human behavior; while the discriminator tries to verify whether the triplets of query, response, and human behavior come from real online environments. Behavior modeling in natural-language form and the multi-model joint training mechanism enable an active and sustainable online alignment. Experimental results confirm the effectiveness of our proposed methods by both human and automatic evaluations.

The Real, the Better: Aligning Large Language Models with Online Human Behaviors

TL;DR

This work introduces RLHB, a framework to align LLMs with real-time online human behaviors by treating the problem as adversarial learning between a generator and a behavior-conditioned discriminator. By modeling user behaviors as natural-language signals and using a multi-model, jointly trained setup, RLHB enables continuous online adaptation without relying solely on offline annotations. Experimental results show that online-behavior alignment can achieve gains comparable to offline preference methods and, in some setups, augment RLHF performance, while providing practical benefits in dynamic environments. The approach offers a scalable path for deploying LLMs that remain aligned with evolving user preferences in real-world online platforms, albeit with considerations around data quality, sample efficiency, and training stability.

Abstract

Large language model alignment is widely used and studied to avoid LLM producing unhelpful and harmful responses. However, the lengthy training process and predefined preference bias hinder adaptation to online diverse human preferences. To this end, this paper proposes an alignment framework, called Reinforcement Learning with Human Behavior (RLHB), to align LLMs by directly leveraging real online human behaviors. By taking the generative adversarial framework, the generator is trained to respond following expected human behavior; while the discriminator tries to verify whether the triplets of query, response, and human behavior come from real online environments. Behavior modeling in natural-language form and the multi-model joint training mechanism enable an active and sustainable online alignment. Experimental results confirm the effectiveness of our proposed methods by both human and automatic evaluations.
Paper Structure (31 sections, 12 equations, 6 figures)

This paper contains 31 sections, 12 equations, 6 figures.

Figures (6)

  • Figure 1: Illustration of collecting online human behaviors in Baidu Search. When a user enters a search query, the answer generated by the LLM can appear at the forefront of the search results. Then, the user can interact with the system through various behaviors, such as clicking the contents, giving a like or dislike, or changing the query.
  • Figure 2: The training process of RLHB.
  • Figure 3: GPT4 and Human Evaluations. Even though the results differ to varying degrees, most confirm the feasibility of the proposed questions \ref{['issue-one']} and \ref{['issue-two']}, especially from the perspective of RLHB and RLHF + RLHB.
  • Figure 4: Win Rates and Mean Rewards for RLHF, RLHBC, and RLHF + RLHBC. Compared with CM of RLHBC models, RM of RLHF is much easier to guide the model to achieve preference learning and model convergence.
  • Figure 5: Discriminator Loss, Discriminator Rewards, and Mean Returns for RLHB and RLHF + RLHB. Different from before, the rewards in RLHB will not continue to grow, but will eventually converge to a confusion state, close to 0.5, instead. That means the policy generation ability can already confuse the real with the fake, though this instability may lower the expected returns.
  • ...and 1 more figures