Using General Large Language Models to Classify Mathematical Documents
Patrick D. F. Ion, Stephen M. Watt
TL;DR
This study investigates whether general large language models can classify mathematical documents according to the MSC2020 scheme using only titles and abstracts. By prompting a public LLM on a curated set of recent arXiv items and comparing outputs to arXiv/author MSC labels, the authors find that approximately 60% of primary classifications match, while about 40% differ; many discrepancies reveal either gaps in arXiv labeling or superior LLM-derived classifications. The work maps where LLMs align with or diverge from ground truth, documents biases from small sample size and input text quality, and demonstrates that LLMs can yield plausible, sometimes better, classifications in mathematics. The paper argues that these findings justify more extensive, automated studies and tooling to support automatedMSC annotations, with attention to robustness, formula handling, and potential web-service deployment. Overall, the results suggest that LLMs are a promising component for automated mathematics literature classification, warranting further development and larger-scale evaluation, while also highlighting hallucination risks that must be mitigated.
Abstract
In this article we report on an initial exploration to assess the viability of using the general large language models (LLMs), recently made public, to classify mathematical documents. Automated classification would be useful from the applied perspective of improving the navigation of the literature and the more open-ended goal of identifying relations among mathematical results. The Mathematical Subject Classification MSC 2020, from MathSciNet and zbMATH, is widely used and there is a significant corpus of ground truth material in the open literature. We have evaluated the classification of preprint articles from arXiv.org according to MSC 2020. The experiment used only the title and abstract alone -- not the entire paper. Since this was early in the use of chatbots and the development of their APIs, we report here on what was carried out by hand. Of course, the automation of the process will have to follow if it is to be generally useful. We found that in about 60% of our sample the LLM produced a primary classification matching that already reported on arXiv. In about half of those instances, there were additional primary classifications that were not detected. In about 40% of our sample, the LLM suggested a different classification than what was provided. A detailed examination of these cases, however, showed that the LLM-suggested classifications were in most cases better than those provided.
