Detecting LLM-Assisted Academic Dishonesty using Keystroke Dynamics
Atharva Mehta, Rajesh Kumar, Aman Singla, Kartik Bisht, Yaman Kumar Singla, Rajiv Ratn Shah
TL;DR
This work demonstrates that keystroke dynamics, as a behavioral signal, can robustly differentiate human-authored from AI-assisted writing when used alongside text features. By expanding datasets and introducing paraphrasing, the authors benchmark gradient-boosted detectors (LightGBM, CatBoost) and a Siamese-TypeNet model, revealing that ML models excel in structured contexts while TypeNet excels in paraphrase scenarios. A deception threat model shows forged keystrokes can degrade performance, which the authors counter with adversarial training, achieving strong robustness. Across comparisons with DetectGPT, LLaMA-3.3-70B-Instruct, and human evaluators, purely text-based or subjective judgments perform near chance, highlighting the value of incorporating process-level data for academic integrity assessment. The study suggests a practical, multimodal approach to deter GenAI-assisted cheating while acknowledging limitations and ethical considerations, with future work focusing on larger, diverse datasets and additional behavioral cues.
Abstract
The rapid adoption of generative AI tools has intensified the challenge of maintaining academic integrity. Conventional plagiarism detectors, which rely on text-matching or text-intrinsic features, often fail to identify submissions that have been AI-assisted or paraphrased. To address this limitation, we introduce keystroke-dynamics-based detectors that analyze how, rather than what, a person writes to distinguish genuine from assisted writing. Building on our earlier study, which collected keystroke data from 40 participants and trained a modified TypeNet model to detect assisted text, we expanded the dataset by adding 90 new participants and introducing a paraphrasing-based plagiarism-detection mode. We then benchmarked two additional gradient-boosting classifiers, LightGBM and CatBoost, alongside TypeNet, and compared their performance with DetectGPT, LLaMA 3.3 70B Instruct, and the results of 44 human evaluators. To further assess and improve robustness, we proposed a deception-based threat model simulating forged keystrokes and applied adversarial training as a countermeasure. Results show that the machine learning models achieve F1 scores above 97% in structured settings, while TypeNet performs best in detecting paraphrasing, with an F1 score of 86.9%. In contrast, text-only detectors and human evaluators perform near-chance, demonstrating that keystroke dynamics provide a strong behavioral signal for identifying AI-assisted plagiarism and support the use of multimodal behavioral features for reliable academic integrity assessment.
