Simplex-Optimized Hybrid Ensemble for Large Language Model Text Detection Under Generative Distribution Drif
Sepyan Purnama Kristanto, Lutfi Hakim, Dianni Yusuf
TL;DR
The paper addresses the instability of LLM-generated text detectors under generative distribution drift by introducing a simplex-constrained hybrid ensemble that combines a RoBERTa-based semantic detector, a curvature-based likelihood perturbation score, and a stylometric classifier. The authors formalize risk under generator mixtures and justify convex simplex fusion to reduce worst-case error while remaining lightweight to deploy. Empirically, on GenDrift-30K, the ensemble achieves 94.2% accuracy and AUC 0.978, with notably lower false positives on academic text and strong cross-generator generalization, including paraphrase attacks. The work demonstrates the practical viability of interpretable, modular ensembles for robust AI-text detection in educational and research contexts, and outlines future directions in distillation, dynamic fusion, and multilingual evaluation.
Abstract
The widespread adoption of large language models (LLMs) has made it difficult to distinguish human writing from machine-produced text in many real applications. Detectors that were effective for one generation of models tend to degrade when newer models or modified decoding strategies are introduced. In this work, we study this lack of stability and propose a hybrid ensemble that is explicitly designed to cope with changing generator distributions. The ensemble combines three complementary components: a RoBERTa-based classifier fine-tuned for supervised detection, a curvature-inspired score based on perturbing the input and measuring changes in model likelihood, and a compact stylometric model built on hand-crafted linguistic features. The outputs of these components are fused on the probability simplex, and the weights are chosen via validation-based search. We frame this approach in terms of variance reduction and risk under mixtures of generators, and show that the simplex constraint provides a simple way to trade off the strengths and weaknesses of each branch. Experiments on a 30000 document corpus drawn from several LLM families including models unseen during training and paraphrased attack variants show that the proposed method achieves 94.2% accuracy and an AUC of 0.978. The ensemble also lowers false positives on scientific articles compared to strong baselines, which is critical in educational and research settings where wrongly flagging human work is costly
