Table of Contents
Fetching ...

DALD: Improving Logits-based Detector without Logits from Black-box LLMs

Cong Zeng, Shengkun Tang, Xianjun Yang, Yuanzhou Chen, Yiyou Sun, zhiqiang xu, Yao Li, Haifeng Chen, Wei Cheng, Dongkuan Xu

TL;DR

Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection even without logits from source LLMs, is presented.

Abstract

The advent of Large Language Models (LLMs) has revolutionized text generation, producing outputs that closely mimic human writing. This blurring of lines between machine- and human-written text presents new challenges in distinguishing one from the other a task further complicated by the frequent updates and closed nature of leading proprietary LLMs. Traditional logits-based detection methods leverage surrogate models for identifying LLM-generated content when the exact logits are unavailable from black-box LLMs. However, these methods grapple with the misalignment between the distributions of the surrogate and the often undisclosed target models, leading to performance degradation, particularly with the introduction of new, closed-source models. Furthermore, while current methodologies are generally effective when the source model is identified, they falter in scenarios where the model version remains unknown, or the test set comprises outputs from various source models. To address these limitations, we present Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection even without logits from source LLMs. DALD is designed to align the surrogate model's distribution with that of unknown target LLMs, ensuring enhanced detection capability and resilience against rapid model iterations with minimal training investment. By leveraging corpus samples from publicly accessible outputs of advanced models such as ChatGPT, GPT-4 and Claude-3, DALD fine-tunes surrogate models to synchronize with unknown source model distributions effectively.

DALD: Improving Logits-based Detector without Logits from Black-box LLMs

TL;DR

Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection even without logits from source LLMs, is presented.

Abstract

The advent of Large Language Models (LLMs) has revolutionized text generation, producing outputs that closely mimic human writing. This blurring of lines between machine- and human-written text presents new challenges in distinguishing one from the other a task further complicated by the frequent updates and closed nature of leading proprietary LLMs. Traditional logits-based detection methods leverage surrogate models for identifying LLM-generated content when the exact logits are unavailable from black-box LLMs. However, these methods grapple with the misalignment between the distributions of the surrogate and the often undisclosed target models, leading to performance degradation, particularly with the introduction of new, closed-source models. Furthermore, while current methodologies are generally effective when the source model is identified, they falter in scenarios where the model version remains unknown, or the test set comprises outputs from various source models. To address these limitations, we present Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection even without logits from source LLMs. DALD is designed to align the surrogate model's distribution with that of unknown target LLMs, ensuring enhanced detection capability and resilience against rapid model iterations with minimal training investment. By leveraging corpus samples from publicly accessible outputs of advanced models such as ChatGPT, GPT-4 and Claude-3, DALD fine-tunes surrogate models to synchronize with unknown source model distributions effectively.
Paper Structure (46 sections, 1 theorem, 18 equations, 6 figures, 14 tables)

This paper contains 46 sections, 1 theorem, 18 equations, 6 figures, 14 tables.

Key Result

Theorem 1

With fine-tuning sample size $K_1$ = $\Omega (\text{poly} (\Delta / L) )$, with probability $1 - \delta$, we have that given a text segment $X$ with length $l$, the conditional probability curvature between the two models is bounded by

Figures (6)

  • Figure 1: The probability curvatures distribution of the surrogate model (GPT-2), the target model (Llama-3) and the model after alignment (GPT-2_DALD) on human-written passages and machine-generated passages from PubMed dataset.
  • Figure 2: The performance comparison of a static surrogate model on different target models including ChatGPT (GPT-3.5) and GPT-4. The results are based on Fast-DetectGPT with GPT-Neo-2.7B as the surrogate model.
  • Figure 3: An overview of our proposed DALD framework. Our method aligns the distribution of the surrogate model and the target model.
  • Figure 4: The FPR-TPR curve of different methods on XSum, Writing and PubMed dataset. The results show that our method achieves highest score at low FPR compared with DNA-GPT and Fast-DetectGPT.
  • Figure 5: AUORC results from our fine-tuned surrogate model with different training dataset size.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Definition 2
  • Remark 4