Table of Contents
Fetching ...

HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection

Fangqi Dai, Xingjian Jiang, Zizhuang Deng

TL;DR

The paper tackles the challenge of detecting texts revised or generated by advanced LLMs, especially in black-box settings where model internals are unknown. It introduces Human Language Preference Detection (HLPD), which aligns the detector’s scoring model to human writing via Human Language Preference Optimization (HLPO) and uses Human Language Preference Conditional Probability Curvature (HLP-CPC) for detection. HLPO trains the scorer to prefer human-written text over machine-revised text, enhancing sensitivity to human-like style and improving robustness across multi-task revisions and languages. Empirical results show substantial AUROC gains over state-of-the-art baselines, strong robustness to adversarial revisions, and favorable efficiency, with additional analysis confirming the value of the human-style alignment and its potential for downstream attacks. The work demonstrates a practical, black-box detector framework capable of handling diverse revision and generation scenarios and highlights limitations around generalization to unseen domains and very short texts.

Abstract

To prevent misinformation and social issues arising from trustworthy-looking content generated by LLMs, it is crucial to develop efficient and reliable methods for identifying the source of texts. Previous approaches have demonstrated exceptional performance in detecting texts fully generated by LLMs. However, these methods struggle when confronting more advanced LLM output or text with adversarial multi-task machine revision, especially in the black-box setting, where the generating model is unknown. To address this challenge, grounded in the hypothesis that human writing possesses distinctive stylistic patterns, we propose Human Language Preference Detection (HLPD). HLPD employs a reward-based alignment process, Human Language Preference Optimization (HLPO), to shift the scoring model's token distribution toward human-like writing, making the model more sensitive to human writing, therefore enhancing the identification of machine-revised text. We test HLPD in an adversarial multi-task evaluation framework that leverages a five-dimensional prompt generator and multiple advanced LLMs to create diverse revision scenarios. When detecting texts revised by GPT-series models, HLPD achieves a 15.11% relative improvement in AUROC over ImBD, surpassing Fast-DetectGPT by 45.56%. When evaluated on texts generated by advanced LLMs, HLPD achieves the highest average AUROC, exceeding ImBD by 5.53% and Fast-DetectGPT by 34.14%. Code will be made available at https://github.com/dfq2021/HLPD.

HLPD: Aligning LLMs to Human Language Preference for Machine-Revised Text Detection

TL;DR

The paper tackles the challenge of detecting texts revised or generated by advanced LLMs, especially in black-box settings where model internals are unknown. It introduces Human Language Preference Detection (HLPD), which aligns the detector’s scoring model to human writing via Human Language Preference Optimization (HLPO) and uses Human Language Preference Conditional Probability Curvature (HLP-CPC) for detection. HLPO trains the scorer to prefer human-written text over machine-revised text, enhancing sensitivity to human-like style and improving robustness across multi-task revisions and languages. Empirical results show substantial AUROC gains over state-of-the-art baselines, strong robustness to adversarial revisions, and favorable efficiency, with additional analysis confirming the value of the human-style alignment and its potential for downstream attacks. The work demonstrates a practical, black-box detector framework capable of handling diverse revision and generation scenarios and highlights limitations around generalization to unseen domains and very short texts.

Abstract

To prevent misinformation and social issues arising from trustworthy-looking content generated by LLMs, it is crucial to develop efficient and reliable methods for identifying the source of texts. Previous approaches have demonstrated exceptional performance in detecting texts fully generated by LLMs. However, these methods struggle when confronting more advanced LLM output or text with adversarial multi-task machine revision, especially in the black-box setting, where the generating model is unknown. To address this challenge, grounded in the hypothesis that human writing possesses distinctive stylistic patterns, we propose Human Language Preference Detection (HLPD). HLPD employs a reward-based alignment process, Human Language Preference Optimization (HLPO), to shift the scoring model's token distribution toward human-like writing, making the model more sensitive to human writing, therefore enhancing the identification of machine-revised text. We test HLPD in an adversarial multi-task evaluation framework that leverages a five-dimensional prompt generator and multiple advanced LLMs to create diverse revision scenarios. When detecting texts revised by GPT-series models, HLPD achieves a 15.11% relative improvement in AUROC over ImBD, surpassing Fast-DetectGPT by 45.56%. When evaluated on texts generated by advanced LLMs, HLPD achieves the highest average AUROC, exceeding ImBD by 5.53% and Fast-DetectGPT by 34.14%. Code will be made available at https://github.com/dfq2021/HLPD.

Paper Structure

This paper contains 28 sections, 11 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Comparison of detecting methods across different scenarios. Detection accuracy of Fast‑DetectGPT, ImBD, GPTZero, and HLPD across machine‑generated text (Generate), three tasks revisions (Polish, Expand, Rewrite), and adversarial multi‑task revisions. All methods excel when machine‑style cues are strong (Generate) but suffer a sharp drop on revising tasks where machine characteristics are attenuated. ImBD amplifies these cues to outperform Fast‑DetectGPT, yet still degrades on more general situations, whereas HLPD sustains high accuracy across every scenario.
  • Figure 2: Challenges of Fast-DetectGPT and ImBD under adversarial revision tasks and overview of HLPD.Top row: On the left, logit-based detectors such as Fast-DetectGPT struggle when texts $x_h$ are only revised, due to weak machine signals. On the right, in ImBD, scoring model after SPO still favors machine-like signals. This preference can limit generalization, especially when faced with revisions from advanced models under diverse prompts. Bottom row: HLPO forms paired human-written and machine-revised texts to train model $p_\theta$, aligning it directly to human writing style. With the trained $\hat{p}_\theta$, HLPD can reliably detect subtle deviations across various scenarios, including minimal revisions and adversarially generated texts by state-of-the-art LLMs, thereby significantly improving robustness in black-box settings.
  • Figure 3: ROC Curve for Detection.
  • Figure 4: ROC Curve with 95% CI accross different datasets.
  • Figure 5: Average AUROC with 95% CI across different datasets.
  • ...and 6 more figures