Table of Contents
Fetching ...

Text mining arXiv: a look through quantitative finance papers

Michele Leonardo Bianchi

TL;DR

The paper addresses mapping topic evolution and identifying influential researchers and journals in quantitative finance using arXiv. It builds a large full-text corpus (≈16k papers from 1997–2022), applies preprocessing, and compares multiple topic-modeling algorithms (Doc2Vec, LDA, Word2Vec, Top2Vec, BERTopic) to discover topics and trends, and extracts author and journal signals. Doc2Vec with K-means on cleaned data emerges as the strongest performer, yielding 30 topics with representative documents and revealing growth in areas like decentralized finance and AI-driven stock prediction, alongside a curated list of prolific journals and authors with appropriate caveats about arXiv bias. The study provides an open, replicable pipeline for text-based bibliometric analyses that can guide researchers and bibliometric studies, while acknowledging data-source limitations and suggesting avenues for future work in network analyses and broader data integration.

Abstract

This paper explores articles hosted on the arXiv preprint server with the aim to uncover valuable insights hidden in this vast collection of research. Employing text mining techniques and through the application of natural language processing methods, we examine the contents of quantitative finance papers posted in arXiv from 1997 to 2022. We extract and analyze crucial information from the entire documents, including the references, to understand the topics trends over time and to find out the most cited researchers and journals on this domain. Additionally, we compare numerous algorithms to perform topic modeling, including state-of-the-art approaches.

Text mining arXiv: a look through quantitative finance papers

TL;DR

The paper addresses mapping topic evolution and identifying influential researchers and journals in quantitative finance using arXiv. It builds a large full-text corpus (≈16k papers from 1997–2022), applies preprocessing, and compares multiple topic-modeling algorithms (Doc2Vec, LDA, Word2Vec, Top2Vec, BERTopic) to discover topics and trends, and extracts author and journal signals. Doc2Vec with K-means on cleaned data emerges as the strongest performer, yielding 30 topics with representative documents and revealing growth in areas like decentralized finance and AI-driven stock prediction, alongside a curated list of prolific journals and authors with appropriate caveats about arXiv bias. The study provides an open, replicable pipeline for text-based bibliometric analyses that can guide researchers and bibliometric studies, while acknowledging data-source limitations and suggesting avenues for future work in network analyses and broader data integration.

Abstract

This paper explores articles hosted on the arXiv preprint server with the aim to uncover valuable insights hidden in this vast collection of research. Employing text mining techniques and through the application of natural language processing methods, we examine the contents of quantitative finance papers posted in arXiv from 1997 to 2022. We extract and analyze crucial information from the entire documents, including the references, to understand the topics trends over time and to find out the most cited researchers and journals on this domain. Additionally, we compare numerous algorithms to perform topic modeling, including state-of-the-art approaches.
Paper Structure (8 sections, 7 figures, 1 table)

This paper contains 8 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Categories by year.
  • Figure 2: The Flesch reading ease score is reported. The vertical line represents the median, the dashed line is the quantile of level 0.25 (0.75), and the dotted line is the quantile of level 0.005 (0.995).
  • Figure 3: Frequent words of the corpus and their percentage of appearance.
  • Figure 4: Bigrams and tri(four)grams word clouds based on frequency with parameters min count equal to 250 and threshold equal to 10.
  • Figure 5: Papers length, in terms of number of words, for raw, lemmatized and cleaned data. The x-axis values are in thousands. The scale of x-axis varies across the three datasets. The vertical line represents the median, the dashed line is the quantile of level 0.25 (0.75), and the dotted line is the quantile of level 0.01 (0.99).
  • ...and 2 more figures