Table of Contents
Fetching ...

MGM: Global Understanding of Audience Overlap Graphs for Predicting the Factuality and the Bias of News Media

Muhammad Arslan Manzoor, Ruihong Zeng, Dilshod Azizov, Preslav Nakov, Shangsong Liang

TL;DR

MGM addresses the challenge of profiling news media by factuality and political bias in graph-rich environments where edges encode audience overlap and labels are scarce. It extends GNNs with a variational EM framework that leverages globally similar nodes stored in an external memory, selecting a sparse set of candidate nodes via a Dirichlet prior, and combines local and global information through a flexible mix parameter. The framework also integrates with pre-trained language models, boosting performance when textual data for some outlets is missing, and achieves new state-of-the-art results on MBFC-derived benchmarks. Empirically, MGM improves several base GNNs, demonstrates robustness to memory configurations, and delivers substantial gains when fused with PLMs, highlighting its practical impact for scalable media profiling and misinformation mitigation.

Abstract

In the current era of rapidly growing digital data, evaluating the political bias and factuality of news outlets has become more important for seeking reliable information online. In this work, we study the classification problem of profiling news media from the lens of political bias and factuality. Traditional profiling methods, such as Pre-trained Language Models (PLMs) and Graph Neural Networks (GNNs) have shown promising results, but they face notable challenges. PLMs focus solely on textual features, causing them to overlook the complex relationships between entities, while GNNs often struggle with media graphs containing disconnected components and insufficient labels. To address these limitations, we propose MediaGraphMind (MGM), an effective solution within a variational Expectation-Maximization (EM) framework. Instead of relying on limited neighboring nodes, MGM leverages features, structural patterns, and label information from globally similar nodes. Such a framework not only enables GNNs to capture long-range dependencies for learning expressive node representations but also enhances PLMs by integrating structural information and therefore improving the performance of both models. The extensive experiments demonstrate the effectiveness of the proposed framework and achieve new state-of-the-art results. Further, we share our repository1 which contains the dataset, code, and documentation

MGM: Global Understanding of Audience Overlap Graphs for Predicting the Factuality and the Bias of News Media

TL;DR

MGM addresses the challenge of profiling news media by factuality and political bias in graph-rich environments where edges encode audience overlap and labels are scarce. It extends GNNs with a variational EM framework that leverages globally similar nodes stored in an external memory, selecting a sparse set of candidate nodes via a Dirichlet prior, and combines local and global information through a flexible mix parameter. The framework also integrates with pre-trained language models, boosting performance when textual data for some outlets is missing, and achieves new state-of-the-art results on MBFC-derived benchmarks. Empirically, MGM improves several base GNNs, demonstrates robustness to memory configurations, and delivers substantial gains when fused with PLMs, highlighting its practical impact for scalable media profiling and misinformation mitigation.

Abstract

In the current era of rapidly growing digital data, evaluating the political bias and factuality of news outlets has become more important for seeking reliable information online. In this work, we study the classification problem of profiling news media from the lens of political bias and factuality. Traditional profiling methods, such as Pre-trained Language Models (PLMs) and Graph Neural Networks (GNNs) have shown promising results, but they face notable challenges. PLMs focus solely on textual features, causing them to overlook the complex relationships between entities, while GNNs often struggle with media graphs containing disconnected components and insufficient labels. To address these limitations, we propose MediaGraphMind (MGM), an effective solution within a variational Expectation-Maximization (EM) framework. Instead of relying on limited neighboring nodes, MGM leverages features, structural patterns, and label information from globally similar nodes. Such a framework not only enables GNNs to capture long-range dependencies for learning expressive node representations but also enhances PLMs by integrating structural information and therefore improving the performance of both models. The extensive experiments demonstrate the effectiveness of the proposed framework and achieve new state-of-the-art results. Further, we share our repository1 which contains the dataset, code, and documentation

Paper Structure

This paper contains 29 sections, 11 equations, 3 figures, 14 tables, 1 algorithm.

Figures (3)

  • Figure 1: Key components of our proposed approach. The sections highlighted with a grey background represent the architectural contributions introduced by our framework. GNNs store the representation of the media graphs in an external global memory ($M_g$). A Dirichlet prior is used to select the distribution of sparse candidate nodes, which are stored in the sampled memory ($M_s$). The parameters $K$ and $\eta$ control the number of candidate nodes and their influence, balancing local and global information. Since PLMs miss some of the media representation, they leverage MGM representation-based probabilities for the classification task. The detailed pipeline of MGM integration with PLM can be seen in Figure \ref{['method.st']} (Appendix \ref{['stages']}).
  • Figure 2: MGM performance across all GNNs for both tasks, evaluated for different values of $K$ (global similar nodes) and $\eta$ (trade-off hyper-parameter).
  • Figure 3: The pipeline of integrating MGM with PLMs. Stage 1: We use logistic regression (meta-learner) to make predictions on probabilities obtained from PLMs on 472 media sources. For the remaining media sources, we assign $[0.0, 0.0, 0.0]$ probabilities. Stage 2: We use the probabilities produced by PLMs and, for the missing ones, we integrate the probabilities from the best GNN MGM$_{\text{GATv2}}$. The logistic regression is then used to make the predictions. Stage 3: We concatenate the probabilities of the best PLM in Wikipedia and Articles and use logistic regression to make predictions. Stage 4: We use the probabilities obtained from Stage 3, which involve concatenating these probabilities with those generated by three GNNs (MGM$_{\text{FiLM}}$, MGM$_{\text{FAGCN}}$, and MGM$_{\text{GATv2}}$) across five different run seeds. Subsequently, logistic regression is employed to make predictions, and the scores are calculated using the standard deviation.