Table of Contents
Fetching ...

Recent Advances in Text Analysis

Zheng Tracy Ke, Pengsheng Ji, Jiashun Jin, Wanshan Li

TL;DR

The paper surveys topic modeling and neural NLP with a focus on statistical transparency and efficiency, then applies Topic-SCORE and TR-SCORE to the MADStat dataset to extract 11 representative statistics topics and to map cross-topic knowledge flows. It introduces the Hofmann-Stigler joint model for abstracts and citations and demonstrates a data-driven, cross-topic knowledge graph that reveals dissemination patterns across statistics research. Key contributions include a fast, theoretically grounded Topic-SCORE method, an extension to rank topics via TR-SCORE, a data-rich MADStat resource, and insightful findings on journal influence and topic evolution from 1975 to 2015. The work has practical implications for research planning, performance assessment, and understanding how statistical ideas propagate across topics and journals.

Abstract

Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze MADStat - a dataset on statistical publications that we collected and cleaned. The application of Topic-SCORE and other methods on MADStat leads to interesting findings. For example, $11$ representative topics in statistics are identified. For each journal, the evolution of topic weights over time can be visualized, and these results are used to analyze the trends in statistical research. In particular, we propose a new statistical model for ranking the citation impacts of $11$ topics, and we also build a cross-topic citation graph to illustrate how research results on different topics spread to one another. The results on MADStat provide a data-driven picture of the statistical research in $1975$--$2015$, from a text analysis perspective.

Recent Advances in Text Analysis

TL;DR

The paper surveys topic modeling and neural NLP with a focus on statistical transparency and efficiency, then applies Topic-SCORE and TR-SCORE to the MADStat dataset to extract 11 representative statistics topics and to map cross-topic knowledge flows. It introduces the Hofmann-Stigler joint model for abstracts and citations and demonstrates a data-driven, cross-topic knowledge graph that reveals dissemination patterns across statistics research. Key contributions include a fast, theoretically grounded Topic-SCORE method, an extension to rank topics via TR-SCORE, a data-rich MADStat resource, and insightful findings on journal influence and topic evolution from 1975 to 2015. The work has practical implications for research planning, performance assessment, and understanding how statistical ideas propagate across topics and journals.

Abstract

Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze MADStat - a dataset on statistical publications that we collected and cleaned. The application of Topic-SCORE and other methods on MADStat leads to interesting findings. For example, representative topics in statistics are identified. For each journal, the evolution of topic weights over time can be visualized, and these results are used to analyze the trends in statistical research. In particular, we propose a new statistical model for ranking the citation impacts of topics, and we also build a cross-topic citation graph to illustrate how research results on different topics spread to one another. The results on MADStat provide a data-driven picture of the statistical research in --, from a text analysis perspective.
Paper Structure (42 sections, 10 equations, 13 figures, 10 tables)

This paper contains 42 sections, 10 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Left: total numbers of papers and active authors in each year. Middle: average number of papers per author in each year. Right: average number of authors per paper in each year.
  • Figure 2: Yearly citation curves for $4$ papers. Left to right: "sleeping beauty" ( Tibshirani (1996) on Lasso), "transient", "steadily increasing" ( Dempster, Laird and Rubin (1977) on EM algorithm), and "sudden fame" ( Liang and Zeger (1986) on GLM).
  • Figure 3: Journal ranking. Each point is a journal (x-axis: ranking by PageRank, y-axis: ranking by Stigler's model). See Table \ref{['tab:journal']} of the supplement for the full journal names.
  • Figure 4: For $1 \leq k \leq K$ (where $K = 11$), Panel $k$ is the barplot of the 20 words $j$ that have the largest weight $a_j(k)$ among all words (the length of each bar is the value of $a_j(k)$).
  • Figure 5: The overall topic interests of some authors. For interpretation purpose, we select some authors we are familiar with, but similar figures can be generated for other authors.
  • ...and 8 more figures