Table of Contents
Fetching ...

Topo Goes Political: TDA-Based Controversy Detection in Imbalanced Reddit Political Data

Arvindh Arun, Karuna K Chandra, Akshit Sinha, Balakumar Velayutham, Jashn Arora, Manish Jain, Ponnurangam Kumaraguru

TL;DR

This work tackles controversy detection in political discourse on Reddit under real-world class imbalance by introducing a dataset focused on Indian politics and evaluating methods beyond synthetic balance. It combines traditional ML, Graph Neural Networks, and topology-inspired features, notably Persistent Homology, with a new Imbalance Impact Score $\mathcal{I}$ to quantify robustness to imbalance. By constructing user interaction graphs and extracting temporal, structural, and topological signals (including Vietoris–Rips filtrations and persistence images via Giotto-TDA), the study demonstrates that topological features can improve robustness in imbalanced settings and highlights the limitations of existing benchmarks. The dataset and code release aim to establish a practical, reproducible benchmark for controversy detection in real-world, imbalanced political discourse with potential implications for platforms, journalists, and policymakers.

Abstract

The detection of controversial content in political discussions on the Internet is a critical challenge in maintaining healthy digital discourse. Unlike much of the existing literature that relies on synthetically balanced data, our work preserves the natural distribution of controversial and non-controversial posts. This real-world imbalance highlights a core challenge that needs to be addressed for practical deployment. Our study re-evaluates well-established methods for detecting controversial content. We curate our own dataset focusing on the Indian political context that preserves the natural distribution of controversial content, with only 12.9% of the posts in our dataset being controversial. This disparity reflects the true imbalance in real-world political discussions and highlights a critical limitation in the existing evaluation methods. Benchmarking on datasets that model data imbalance is vital for ensuring real-world applicability. Thus, in this work, (i) we release our dataset, with an emphasis on class imbalance, that focuses on the Indian political context, (ii) we evaluate existing methods from this domain on this dataset and demonstrate their limitations in the imbalanced setting, (iii) we introduce an intuitive metric to measure a model's robustness to class imbalance, (iv) we also incorporate ideas from the domain of Topological Data Analysis, specifically Persistent Homology, to curate features that provide richer representations of the data. Furthermore, we benchmark models trained with topological features against established baselines.

Topo Goes Political: TDA-Based Controversy Detection in Imbalanced Reddit Political Data

TL;DR

This work tackles controversy detection in political discourse on Reddit under real-world class imbalance by introducing a dataset focused on Indian politics and evaluating methods beyond synthetic balance. It combines traditional ML, Graph Neural Networks, and topology-inspired features, notably Persistent Homology, with a new Imbalance Impact Score to quantify robustness to imbalance. By constructing user interaction graphs and extracting temporal, structural, and topological signals (including Vietoris–Rips filtrations and persistence images via Giotto-TDA), the study demonstrates that topological features can improve robustness in imbalanced settings and highlights the limitations of existing benchmarks. The dataset and code release aim to establish a practical, reproducible benchmark for controversy detection in real-world, imbalanced political discourse with potential implications for platforms, journalists, and policymakers.

Abstract

The detection of controversial content in political discussions on the Internet is a critical challenge in maintaining healthy digital discourse. Unlike much of the existing literature that relies on synthetically balanced data, our work preserves the natural distribution of controversial and non-controversial posts. This real-world imbalance highlights a core challenge that needs to be addressed for practical deployment. Our study re-evaluates well-established methods for detecting controversial content. We curate our own dataset focusing on the Indian political context that preserves the natural distribution of controversial content, with only 12.9% of the posts in our dataset being controversial. This disparity reflects the true imbalance in real-world political discussions and highlights a critical limitation in the existing evaluation methods. Benchmarking on datasets that model data imbalance is vital for ensuring real-world applicability. Thus, in this work, (i) we release our dataset, with an emphasis on class imbalance, that focuses on the Indian political context, (ii) we evaluate existing methods from this domain on this dataset and demonstrate their limitations in the imbalanced setting, (iii) we introduce an intuitive metric to measure a model's robustness to class imbalance, (iv) we also incorporate ideas from the domain of Topological Data Analysis, specifically Persistent Homology, to curate features that provide richer representations of the data. Furthermore, we benchmark models trained with topological features against established baselines.

Paper Structure

This paper contains 26 sections, 2 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Taxonomy of the diverse features extracted and the methods used.
  • Figure 2: This density plot illustrates the distribution of $\mathrm{UR}$ for both controversial and non-controversial posts in our pilot study (A) and in our dataset (B). The plots reveal a region of separability between the two classes indicated by the vertical lines, which is used to derive a threshold for categorizing posts as controversial or non-controversial.
  • Figure 3: WARNING: The following figure contains potentially offensive language. The Post-Comment Tree of a controversial post (#17i72r4) reveals deep branching and multiple levels of user interaction, highlighting the complexity and depth of engagement.
  • Figure 4: Subfigure (a) shows $G$ of a controversial post (#17i72r4) revealing patterns of cyclic interactions, indicated by motifs in (b) where groups of users repeatedly interact with each other, often with contradicting viewpoints. In (b), the motifs highlighted in orange and blue correspond to the motif types (11) and (12) described in Figure \ref{['fig:motifs']}.
  • Figure 5: The 13 possible 3-motifs we count where each motif represents a different pattern of interactions among three users in a discussion. Counting these motifs provides insights into the interaction dynamics, such as agreement, disagreement, and the formation of echo chambers within the conversation.