Table of Contents
Fetching ...

ReMoDetect: Reward Models Recognize Aligned LLM's Generations

Hyunseok Lee, Jihoon Tack, Jinwoo Shin

TL;DR

This paper draws attention to a common feature of recent powerful LLMs, namely the alignment training, i.e., training LLMs to generate human-preferable texts, and proposes two training schemes to further improve the detection ability of the reward model.

Abstract

The remarkable capabilities and easy accessibility of large language models (LLMs) have significantly increased societal risks (e.g., fake news generation), necessitating the development of LLM-generated text (LGT) detection methods for safe usage. However, detecting LGTs is challenging due to the vast number of LLMs, making it impractical to account for each LLM individually; hence, it is crucial to identify the common characteristics shared by these models. In this paper, we draw attention to a common feature of recent powerful LLMs, namely the alignment training, i.e., training LLMs to generate human-preferable texts. Our key finding is that as these aligned LLMs are trained to maximize the human preferences, they generate texts with higher estimated preferences even than human-written texts; thus, such texts are easily detected by using the reward model (i.e., an LLM trained to model human preference distribution). Based on this finding, we propose two training schemes to further improve the detection ability of the reward model, namely (i) continual preference fine-tuning to make the reward model prefer aligned LGTs even further and (ii) reward modeling of Human/LLM mixed texts (a rephrased texts from human-written texts using aligned LLMs), which serves as a median preference text corpus between LGTs and human-written texts to learn the decision boundary better. We provide an extensive evaluation by considering six text domains across twelve aligned LLMs, where our method demonstrates state-of-the-art results. Code is available at https://github.com/hyunseoklee-ai/ReMoDetect.

ReMoDetect: Reward Models Recognize Aligned LLM's Generations

TL;DR

This paper draws attention to a common feature of recent powerful LLMs, namely the alignment training, i.e., training LLMs to generate human-preferable texts, and proposes two training schemes to further improve the detection ability of the reward model.

Abstract

The remarkable capabilities and easy accessibility of large language models (LLMs) have significantly increased societal risks (e.g., fake news generation), necessitating the development of LLM-generated text (LGT) detection methods for safe usage. However, detecting LGTs is challenging due to the vast number of LLMs, making it impractical to account for each LLM individually; hence, it is crucial to identify the common characteristics shared by these models. In this paper, we draw attention to a common feature of recent powerful LLMs, namely the alignment training, i.e., training LLMs to generate human-preferable texts. Our key finding is that as these aligned LLMs are trained to maximize the human preferences, they generate texts with higher estimated preferences even than human-written texts; thus, such texts are easily detected by using the reward model (i.e., an LLM trained to model human preference distribution). Based on this finding, we propose two training schemes to further improve the detection ability of the reward model, namely (i) continual preference fine-tuning to make the reward model prefer aligned LGTs even further and (ii) reward modeling of Human/LLM mixed texts (a rephrased texts from human-written texts using aligned LLMs), which serves as a median preference text corpus between LGTs and human-written texts to learn the decision boundary better. We provide an extensive evaluation by considering six text domains across twelve aligned LLMs, where our method demonstrates state-of-the-art results. Code is available at https://github.com/hyunseoklee-ai/ReMoDetect.
Paper Structure (27 sections, 4 equations, 10 figures, 16 tables)

This paper contains 27 sections, 4 equations, 10 figures, 16 tables.

Figures (10)

  • Figure 1: AUROC (%) of LLM-generated text detection methods on WritingPrompts from the Fast-DetectGPT benchmark, where GPT4 is used for text generation. 'Reward model' indicates the detection using the reward score of the pre-trained reward model. The bold denotes the best result.
  • Figure 2: Overview of Reward Model based LLM Generated Text Detection (ReMoDetect): We continually fine-tune the reward model $r_\phi$ to prefer aligned LLM-generated responses $y_\mathtt{LM}$ even further while preventing the overfitting by using the replay technique: $(x_\mathtt{buf},y_\mathtt{buf})$ is the replay buffer and $r_{\phi_0}$ is the initial reward model. Moreover, we generate a human/LLM mixed text $y_\mathtt{MIX}$ by partially rephrasing the human response $y_\mathtt{HU}$ using the aligned LLM, which serves as a median preference data compared to $y_\mathtt{LM}$ and $y_\mathtt{HU}$, i.e., $y_{\mathtt{LM}} \succ y_{\mathtt{MIX}} \succ y_{\mathtt{HU}} \mid x$, to improve the reward model's detection ability.
  • Figure 3: Predicted reward distribution of human written texts and LGTs on three different domains, including (a) Essay, (b) WritingPrompts-small, and (c) PubMed. We use the reward model from OpenAssistant andrew2023openassistant. 'Machine' denotes GPT4 Turbo and Claude3 Opus generated texts.
  • Figure 4: Predicted reward distribution of human-written texts and LGTs on three different reward models (RMs), including (a) Gemma 2B (b) Gemma 7B, and (c) Llama3 8B. 'Machine' denotes GPT4 Turbo and Claude3 Opus generated texts. We use WritingPrompts-small as the text domain.
  • Figure 5: Predicted reward distribution of human written texts and LGTs (a) 'Before' and (b) 'After' training the reward model with Eq (\ref{['eq:ours']}). 'Machine' denotes GPT4-Turbo generated texts on Eassy domain.
  • ...and 5 more figures