Table of Contents
Fetching ...

IPAD: Inverse Prompt for AI Detection - A Robust and Interpretable LLM-Generated Text Detector

Zheng Chen, Yushi Feng, Jisheng Dang, Yue Deng, Changyang He, Hongxi Pu, Haoxuan Li, Bo Li

TL;DR

IPAD tackles the challenge of reliably detecting AI-generated text by coupling a Prompt Inverter with two interpretable discriminators that verify prompt-text alignment and regenerated-text consistency. The three-module architecture is trained in a four-step workflow and fused via a weighted ensemble to produce robust, explainable decisions. Empirical results show state-of-the-art performance across in-distribution, out-of-distribution, and attacked data, and a user study demonstrates enhanced interpretability through visible decision evidence. Limitations include inverter coverage of explicit in-context examples and higher compute costs, but IPAD remains more lightweight than many baselines while delivering stronger, explainable detection signals. Overall, IPAD introduces a new paradigm that combines self-consistency checks with interpretable evidence to improve real-world AI text detection.

Abstract

Large Language Models (LLMs) have attained human-level fluency in text generation, which complicates the distinguishing between human-written and LLM-generated texts. This increases the risk of misuse and highlights the need for reliable detectors. Yet, existing detectors exhibit poor robustness on out-of-distribution (OOD) data and attacked data, which is critical for real-world scenarios. Also, they struggle to provide interpretable evidence to support their decisions, thus undermining the reliability. In light of these challenges, we propose IPAD (Inverse Prompt for AI Detection), a novel framework consisting of a Prompt Inverter that identifies predicted prompts that could have generated the input text, and two Distinguishers that examine the probability that the input texts align with the predicted prompts. Empirical evaluations demonstrate that IPAD outperforms the strongest baselines by 9.05% (Average Recall) on in-distribution data, 12.93% (AUROC) on out-of-distribution data, and 5.48% (AUROC) on attacked data. IPAD also performs robustly on structured datasets. Furthermore, an interpretability assessment is conducted to illustrate that IPAD enhances the AI detection trustworthiness by allowing users to directly examine the decision-making evidence, which provides interpretable support for its state-of-the-art detection results.

IPAD: Inverse Prompt for AI Detection - A Robust and Interpretable LLM-Generated Text Detector

TL;DR

IPAD tackles the challenge of reliably detecting AI-generated text by coupling a Prompt Inverter with two interpretable discriminators that verify prompt-text alignment and regenerated-text consistency. The three-module architecture is trained in a four-step workflow and fused via a weighted ensemble to produce robust, explainable decisions. Empirical results show state-of-the-art performance across in-distribution, out-of-distribution, and attacked data, and a user study demonstrates enhanced interpretability through visible decision evidence. Limitations include inverter coverage of explicit in-context examples and higher compute costs, but IPAD remains more lightweight than many baselines while delivering stronger, explainable detection signals. Overall, IPAD introduces a new paradigm that combines self-consistency checks with interpretable evidence to improve real-world AI text detection.

Abstract

Large Language Models (LLMs) have attained human-level fluency in text generation, which complicates the distinguishing between human-written and LLM-generated texts. This increases the risk of misuse and highlights the need for reliable detectors. Yet, existing detectors exhibit poor robustness on out-of-distribution (OOD) data and attacked data, which is critical for real-world scenarios. Also, they struggle to provide interpretable evidence to support their decisions, thus undermining the reliability. In light of these challenges, we propose IPAD (Inverse Prompt for AI Detection), a novel framework consisting of a Prompt Inverter that identifies predicted prompts that could have generated the input text, and two Distinguishers that examine the probability that the input texts align with the predicted prompts. Empirical evaluations demonstrate that IPAD outperforms the strongest baselines by 9.05% (Average Recall) on in-distribution data, 12.93% (AUROC) on out-of-distribution data, and 5.48% (AUROC) on attacked data. IPAD also performs robustly on structured datasets. Furthermore, an interpretability assessment is conducted to illustrate that IPAD enhances the AI detection trustworthiness by allowing users to directly examine the decision-making evidence, which provides interpretable support for its state-of-the-art detection results.

Paper Structure

This paper contains 41 sections, 3 equations, 3 figures, 9 tables, 1 algorithm.

Figures (3)

  • Figure 1: The overall workflow of our proposed IPAD framework
  • Figure 2: The In-distribution data performance of IPAD and the baseline detectors. Since r3 only presents the AvgRec data for the baselines, we also calculate AvgRec data for IPAD to compare.
  • Figure 3: Ablation study. Evaluating Fine-tune only on Input, Fine-tune only on Prompt, Prompt Inverter + PTCV, Prompt Inverter + RC, and IPAD on In-distribution datasets, standard OOD datasets, and attacked OOD datasets.