Table of Contents
Fetching ...

From Noise to Signal: When Outliers Seed New Topics

Evangelia Zve, Gauvain Bourgne, Benjamin Icard, Jean-Gabriel Ganascia

Abstract

Outliers in dynamic topic modeling are typically treated as noise, yet we show that some can serve as early signals of emerging topics. We introduce a temporal taxonomy of news-document trajectories that defines how documents relate to topic formation over time. It distinguishes anticipatory outliers, which precede the topics they later join, from documents that either reinforce existing topics or remain isolated. By capturing these trajectories, the taxonomy links weak-signal detection with temporal topic modeling and clarifies how individual articles anticipate, initiate, or drift within evolving clusters. We implement it in a cumulative clustering setting using document embeddings from eleven state-of-the-art language models and evaluate it retrospectively on HydroNewsFr, a French news corpus on the hydrogen economy. Inter-model agreement reveals a small, high-consensus subset of anticipatory outliers, increasing confidence in these labels. Qualitative case studies further illustrate these trajectories through concrete topic developments.

From Noise to Signal: When Outliers Seed New Topics

Abstract

Outliers in dynamic topic modeling are typically treated as noise, yet we show that some can serve as early signals of emerging topics. We introduce a temporal taxonomy of news-document trajectories that defines how documents relate to topic formation over time. It distinguishes anticipatory outliers, which precede the topics they later join, from documents that either reinforce existing topics or remain isolated. By capturing these trajectories, the taxonomy links weak-signal detection with temporal topic modeling and clarifies how individual articles anticipate, initiate, or drift within evolving clusters. We implement it in a cumulative clustering setting using document embeddings from eleven state-of-the-art language models and evaluate it retrospectively on HydroNewsFr, a French news corpus on the hydrogen economy. Inter-model agreement reveals a small, high-consensus subset of anticipatory outliers, increasing confidence in these labels. Qualitative case studies further illustrate these trajectories through concrete topic developments.
Paper Structure (21 sections, 4 equations, 8 figures, 9 tables)

This paper contains 21 sections, 4 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Decision tree mapping conditions on $(T_{\!T},T_{\!A},T_{\!I},\theta_{\!\text{ delay}}\xspace)$ to the seven cases. Left branch splits outlier-integrations $(\mathcal{TOA}_{\!\text{ first}}\xspace,\mathcal{TOA}_{\!\text{ late}}\xspace,\mathcal{TOD}_{\!\text{ late}}\xspace)$ from non-outlier integrations $(\mathcal{T}_{\!\text{ first}}\xspace,\mathcal{T}_{\!\text{ late}}\xspace)$; right branch yields outliers $(\mathcal{O}_{\!\text{ recent}}\xspace,\mathcal{O}_{\!\text{ old}}\xspace)$ persisting until $T_{\!\text{ final}}$. ✓ / ✗ indicate branch outcomes.
  • Figure 2: Overview of the seven taxonomy cases. Panels (a–g) show temporal relations between $T_{\!A}$, $T_{\!T}$, $T_{\!I}$, $\theta_{\!\text{ delay}}\xspace$, $T_{\!\text{ final}}$. Colored belts indicate topic activity and outlier phases.
  • Figure 3: Temporal distribution of HydroNewsFr over the collection period (20 March--8 June 2025). Bars show daily publication counts; the stepped line and hatched area show cumulative totals.
  • Figure 4: Cumulative clustering over the first 9 time windows with mistral-embed and 2D UMAP. Colors indicate topics; newly assigned documents are larger and more opaque; black $\times$ denote outliers.
  • Figure 5: Empirical survival curve $S(t)=P(\Delta T>t)$ for integration delays, pooled across models and configurations. Dashed lines mark quantiles; the 90th percentile ($p_{90}=26$ days) defines $\theta_{\text{delay}}$.
  • ...and 3 more figures