Identifying Emerging Concepts in Large Corpora
Sibo Ma, Julian Nyarko
TL;DR
This work tackles the challenge of identifying emergent concepts in large text corpora without supervision by introducing a heatmap-based pipeline that maps sentence embeddings to a 2D space, constructs per-period heatmaps, and uses difference heatmaps with Laplacian-of-Gaussian blob detection to locate novel semantic regions. Blobs are tracked over time to form concept trajectories, with parameter settings controlling the balance between sudden, short-lived notions and longer-lasting ideas. The approach outperforms word-level diachronic and clustering baselines on synthetic data and the COHA corpus, and is demonstrated on over two million U.S. Senate speeches (1941–2015), revealing that the minority party more frequently introduces new concepts and that certain concepts align with identity groups. The method provides a scalable, unsupervised tool for digital humanities and social science research, and a public implementation is available for further exploration.
Abstract
We introduce a new method to identify emerging concepts in large text corpora. By analyzing changes in the heatmaps of the underlying embedding space, we are able to detect these concepts with high accuracy shortly after they originate, in turn outperforming common alternatives. We further demonstrate the utility of our approach by analyzing speeches in the U.S. Senate from 1941 to 2015. Our results suggest that the minority party is more active in introducing new concepts into the Senate discourse. We also identify specific concepts that closely correlate with the Senators' racial, ethnic, and gender identities. An implementation of our method is publicly available.
