Taxonomy of Mathematical Plagiarism
Ankit Satpute, Andre Greiner-Petter, Noah Gießing, Isabel Beckenbach, Moritz Schubotz, Olaf Teschke, Akiko Aizawa, Bela Gipp
TL;DR
The paper tackles semantic-level mathematical plagiarism, which eludes traditional text-based detectors due to heavy symbolic content. It builds a taxonomy of mathematical content similarity by annotating 122 real-world document pairs from zbMATH Open and defines 7 obfuscation operators (P, ID, S, TMMT, DP, FM, VS) to describe reuse transformations. Four detectors are evaluated on a large arXiv-LaTeX test collection, revealing that most human-modified math content cases remain undetected, with best PlagDet scores of $0.06$ for plagiarism and $0.16$ for math-content similarity. The work provides a publicly released dataset and code to catalyze the development of more robust math-content detection methods, with implications for plagiarism prevention, recommender systems, QA, and search engines.
Abstract
Plagiarism is a pressing concern, even more so with the availability of large language models. Existing plagiarism detection systems reliably find copied and moderately reworded text but fail for idea plagiarism, especially in mathematical science, which heavily uses formal mathematical notation. We make two contributions. First, we establish a taxonomy of mathematical content reuse by annotating potentially plagiarised 122 scientific document pairs. Second, we analyze the best-performing approaches to detect plagiarism and mathematical content similarity on the newly established taxonomy. We found that the best-performing methods for plagiarism and math content similarity achieve an overall detection score (PlagDet) of 0.06 and 0.16, respectively. The best-performing methods failed to detect most cases from all seven newly established math similarity types. Outlined contributions will benefit research in plagiarism detection systems, recommender systems, question-answering systems, and search engines. We make our experiment's code and annotated dataset available to the community: https://github.com/gipplab/Taxonomy-of-Mathematical-Plagiarism
