Table of Contents
Fetching ...

EvoTaxo: Building and Evolving Taxonomy from Social Media Streams

Yiyang Li, Tianyi Ma, Yanfang Ye

Abstract

Constructing taxonomies from social media corpora is challenging because posts are short, noisy, semantically entangled, and temporally dynamic. Existing taxonomy induction methods are largely designed for static corpora and often struggle to balance robustness, scalability, and sensitivity to evolving discourse. We propose EvoTaxo, a LLM-based framework for building and evolving taxonomies from temporally ordered social media streams. Rather than clustering raw posts directly, EvoTaxo converts each post into a structured draft action over the current taxonomy, accumulates structural evidence over time windows, and consolidates candidate edits through dual-view clustering that combines semantic similarity with temporal locality. A refinement-and-arbitration procedure then selects reliable edits before execution, while each node maintains a concept memory bank to preserve semantic boundaries over time. Experiments on two Reddit corpora show that EvoTaxo produces more balanced taxonomies than baselines, with clearer post-to-leaf assignment, better corpus coverage at comparable taxonomy size, and stronger structural quality. A case study on the Reddit community /r/ICE_Raids further shows that EvoTaxo captures meaningful temporal shifts in discourse. Our codebase is available here.

EvoTaxo: Building and Evolving Taxonomy from Social Media Streams

Abstract

Constructing taxonomies from social media corpora is challenging because posts are short, noisy, semantically entangled, and temporally dynamic. Existing taxonomy induction methods are largely designed for static corpora and often struggle to balance robustness, scalability, and sensitivity to evolving discourse. We propose EvoTaxo, a LLM-based framework for building and evolving taxonomies from temporally ordered social media streams. Rather than clustering raw posts directly, EvoTaxo converts each post into a structured draft action over the current taxonomy, accumulates structural evidence over time windows, and consolidates candidate edits through dual-view clustering that combines semantic similarity with temporal locality. A refinement-and-arbitration procedure then selects reliable edits before execution, while each node maintains a concept memory bank to preserve semantic boundaries over time. Experiments on two Reddit corpora show that EvoTaxo produces more balanced taxonomies than baselines, with clearer post-to-leaf assignment, better corpus coverage at comparable taxonomy size, and stronger structural quality. A case study on the Reddit community /r/ICE_Raids further shows that EvoTaxo captures meaningful temporal shifts in discourse. Our codebase is available here.
Paper Structure (35 sections, 5 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 35 sections, 5 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: K-Means clustering of raw social media posts and their corresponding LLM-generated actions, where the number of clusters is selected by the highest silhouette score. Compared with raw posts, action representations form substantially clearer cluster structure, motivating action-level consolidation in EvoTaxo.
  • Figure 2: Overview of EvoTaxo. The framework starts from an LLM-generated seed taxonomy with concept memory banks and processes posts chronologically. Each post is mapped to a draft action. Structural actions are accumulated and consolidated at the window boundary under semantic and semantic-temporal views, then refined and arbitrated into final actions. These final actions update both the taxonomy and its temporal grounding records.
  • Figure 3: Monthly evolution of the taxonomy induced from /r/ICE_Raids, aggregated to the five top-level topics. The figure highlights three major turning points: an early concentration on personal-harm narratives, a mid-period protest-driven burst, and a later shift toward institutional oversight and policy discussion. The lower panel shows the number of newly introduced subtopics per month.
  • Figure 4: Taxonomy snapshot for /r/ICE_Raids after the June 2025 update. Green highlights mark subtopics newly added in June, illustrating how EvoTaxo converts a temporally concentrated protest-driven discourse burst into explicit taxonomy growth.
  • Figure 5: Yearly evolution of the taxonomy induced from /r/opiates, aggregated to the seven top-level topics. Early years are dominated by tapering/quitting and street-opiate quality and harm-reduction discussion. Later years show a stronger shift toward medical prescribing and treatment access.