Table of Contents
Fetching ...

Identifying Emerging Concepts in Large Corpora

Sibo Ma, Julian Nyarko

TL;DR

This work tackles the challenge of identifying emergent concepts in large text corpora without supervision by introducing a heatmap-based pipeline that maps sentence embeddings to a 2D space, constructs per-period heatmaps, and uses difference heatmaps with Laplacian-of-Gaussian blob detection to locate novel semantic regions. Blobs are tracked over time to form concept trajectories, with parameter settings controlling the balance between sudden, short-lived notions and longer-lasting ideas. The approach outperforms word-level diachronic and clustering baselines on synthetic data and the COHA corpus, and is demonstrated on over two million U.S. Senate speeches (1941–2015), revealing that the minority party more frequently introduces new concepts and that certain concepts align with identity groups. The method provides a scalable, unsupervised tool for digital humanities and social science research, and a public implementation is available for further exploration.

Abstract

We introduce a new method to identify emerging concepts in large text corpora. By analyzing changes in the heatmaps of the underlying embedding space, we are able to detect these concepts with high accuracy shortly after they originate, in turn outperforming common alternatives. We further demonstrate the utility of our approach by analyzing speeches in the U.S. Senate from 1941 to 2015. Our results suggest that the minority party is more active in introducing new concepts into the Senate discourse. We also identify specific concepts that closely correlate with the Senators' racial, ethnic, and gender identities. An implementation of our method is publicly available.

Identifying Emerging Concepts in Large Corpora

TL;DR

This work tackles the challenge of identifying emergent concepts in large text corpora without supervision by introducing a heatmap-based pipeline that maps sentence embeddings to a 2D space, constructs per-period heatmaps, and uses difference heatmaps with Laplacian-of-Gaussian blob detection to locate novel semantic regions. Blobs are tracked over time to form concept trajectories, with parameter settings controlling the balance between sudden, short-lived notions and longer-lasting ideas. The approach outperforms word-level diachronic and clustering baselines on synthetic data and the COHA corpus, and is demonstrated on over two million U.S. Senate speeches (1941–2015), revealing that the minority party more frequently introduces new concepts and that certain concepts align with identity groups. The method provides a scalable, unsupervised tool for digital humanities and social science research, and a public implementation is available for further exploration.

Abstract

We introduce a new method to identify emerging concepts in large text corpora. By analyzing changes in the heatmaps of the underlying embedding space, we are able to detect these concepts with high accuracy shortly after they originate, in turn outperforming common alternatives. We further demonstrate the utility of our approach by analyzing speeches in the U.S. Senate from 1941 to 2015. Our results suggest that the minority party is more active in introducing new concepts into the Senate discourse. We also identify specific concepts that closely correlate with the Senators' racial, ethnic, and gender identities. An implementation of our method is publicly available.

Paper Structure

This paper contains 29 sections, 3 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of our approach for detecting new topics. Texts are embedded and processed into lower-dimensional heatmaps. The heatmap on the left visualizes an example distribution of embeddings after dimensionality reduction. Next, heatmap subtraction removes existing patterns, leaving new regions of high density. The heatmap on the right shows the embeddings after subtraction, with blobs representing new concepts. These blobs are then detected and linked to form cohesive new concepts, which are labeled in the final output (e.g., Burma Campaign, WWII Pacific Battles, Japanese Exclusion Act Protests)
  • Figure 2: Comparison of $F_1$ scores for three clustering algorithms: Our Approach (red), HDBSCAN (blue), and DPMeans (green) as the size of new topics increases.
  • Figure 3: Changes in topic size (number of sentences contained in a topic) over time for Judicial Activism and Marriage Laws. Discussions first emerged in the 1950s and 1960s, with a first major spike in 1989, followed by a series of peaks from 1995 to 2005.
  • Figure 4: Proportion of new partisan concepts introduced by each party. The red line shows, among all Republican speeches, the proportion of speeches discussing newly introduced, partisan concepts (i.e. concepts for which there is an overrepresentation of Republican speeches). The blue line shows the same for Democratic speeches. The shaded areas indicate periods of party majority: red for Republican majority and blue for Democratic majority.
  • Figure 5: The effect of varying $\rho^*$, the threshold controlling the minimum peak intensity for a blob to be identified, on the pipeline's performance for different sizes of new concepts ($n$). The blue and green lines represent Precision and Recall, respectively.
  • ...and 1 more figures