Text mining arXiv: a look through quantitative finance papers
Michele Leonardo Bianchi
TL;DR
The paper addresses mapping topic evolution and identifying influential researchers and journals in quantitative finance using arXiv. It builds a large full-text corpus (≈16k papers from 1997–2022), applies preprocessing, and compares multiple topic-modeling algorithms (Doc2Vec, LDA, Word2Vec, Top2Vec, BERTopic) to discover topics and trends, and extracts author and journal signals. Doc2Vec with K-means on cleaned data emerges as the strongest performer, yielding 30 topics with representative documents and revealing growth in areas like decentralized finance and AI-driven stock prediction, alongside a curated list of prolific journals and authors with appropriate caveats about arXiv bias. The study provides an open, replicable pipeline for text-based bibliometric analyses that can guide researchers and bibliometric studies, while acknowledging data-source limitations and suggesting avenues for future work in network analyses and broader data integration.
Abstract
This paper explores articles hosted on the arXiv preprint server with the aim to uncover valuable insights hidden in this vast collection of research. Employing text mining techniques and through the application of natural language processing methods, we examine the contents of quantitative finance papers posted in arXiv from 1997 to 2022. We extract and analyze crucial information from the entire documents, including the references, to understand the topics trends over time and to find out the most cited researchers and journals on this domain. Additionally, we compare numerous algorithms to perform topic modeling, including state-of-the-art approaches.
