Table of Contents
Fetching ...

Intertwined Biases Across Social Media Spheres: Unpacking Correlations in Media Bias Dimensions

Yifan Liu, Yike Li, Dong Wang

TL;DR

The paper addresses the fragmentation caused by single-dimension media bias benchmarks by introducing a cross-platform, multi-domain dataset annotated for multiple bias dimensions on YouTube and Reddit collected over five years. It analyzes inter-dimension correlations and temporal dynamics using both manual annotations and automated labeling via shallow models and a large language model, highlighting domain-specific bias expressions and event-driven surges. Key contributions include the first joint labeling of multiple bias dimensions across five domains, a comprehensive correlation and time-series analysis, and recommendations for adaptive multi-task learning to exploit high-bias correlations. The work advances bias identification by providing a resource and insights that support developing more robust, time-aware, and multi-dimensional detection systems, with practical implications for fairer media consumption and ethical journalism.

Abstract

Media bias significantly shapes public perception by reinforcing stereotypes and exacerbating societal divisions. Prior research has often focused on isolated media bias dimensions such as \textit{political bias} or \textit{racial bias}, neglecting the complex interrelationships among various bias dimensions across different topic domains. Moreover, we observe that models trained on existing media bias benchmarks fail to generalize effectively on recent social media posts, particularly in certain bias identification tasks. This shortfall primarily arises because these benchmarks do not adequately reflect the rapidly evolving nature of social media content, which is characterized by shifting user behaviors and emerging trends. In response to these limitations, our research introduces a novel dataset collected from YouTube and Reddit over the past five years. Our dataset includes automated annotations for YouTube content across a broad spectrum of bias dimensions, such as gender, racial, and political biases, as well as hate speech, among others. It spans diverse domains including politics, sports, healthcare, education, and entertainment, reflecting the complex interplay of biases across different societal sectors. Through comprehensive statistical analysis, we identify significant differences in bias expression patterns and intra-domain bias correlations across these domains. By utilizing our understanding of the correlations among various bias dimensions, we lay the groundwork for creating advanced systems capable of detecting multiple biases simultaneously. Overall, our dataset advances the field of media bias identification, contributing to the development of tools that promote fairer media consumption. The comprehensive awareness of existing media bias fosters more ethical journalism, promotes cultural sensitivity, and supports a more informed and equitable public discourse.

Intertwined Biases Across Social Media Spheres: Unpacking Correlations in Media Bias Dimensions

TL;DR

The paper addresses the fragmentation caused by single-dimension media bias benchmarks by introducing a cross-platform, multi-domain dataset annotated for multiple bias dimensions on YouTube and Reddit collected over five years. It analyzes inter-dimension correlations and temporal dynamics using both manual annotations and automated labeling via shallow models and a large language model, highlighting domain-specific bias expressions and event-driven surges. Key contributions include the first joint labeling of multiple bias dimensions across five domains, a comprehensive correlation and time-series analysis, and recommendations for adaptive multi-task learning to exploit high-bias correlations. The work advances bias identification by providing a resource and insights that support developing more robust, time-aware, and multi-dimensional detection systems, with practical implications for fairer media consumption and ethical journalism.

Abstract

Media bias significantly shapes public perception by reinforcing stereotypes and exacerbating societal divisions. Prior research has often focused on isolated media bias dimensions such as \textit{political bias} or \textit{racial bias}, neglecting the complex interrelationships among various bias dimensions across different topic domains. Moreover, we observe that models trained on existing media bias benchmarks fail to generalize effectively on recent social media posts, particularly in certain bias identification tasks. This shortfall primarily arises because these benchmarks do not adequately reflect the rapidly evolving nature of social media content, which is characterized by shifting user behaviors and emerging trends. In response to these limitations, our research introduces a novel dataset collected from YouTube and Reddit over the past five years. Our dataset includes automated annotations for YouTube content across a broad spectrum of bias dimensions, such as gender, racial, and political biases, as well as hate speech, among others. It spans diverse domains including politics, sports, healthcare, education, and entertainment, reflecting the complex interplay of biases across different societal sectors. Through comprehensive statistical analysis, we identify significant differences in bias expression patterns and intra-domain bias correlations across these domains. By utilizing our understanding of the correlations among various bias dimensions, we lay the groundwork for creating advanced systems capable of detecting multiple biases simultaneously. Overall, our dataset advances the field of media bias identification, contributing to the development of tools that promote fairer media consumption. The comprehensive awareness of existing media bias fosters more ethical journalism, promotes cultural sensitivity, and supports a more informed and equitable public discourse.
Paper Structure (23 sections, 3 figures, 4 tables)

This paper contains 23 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Radar plot illustrating the distribution of biased content across various dimensions. Notably, the political domain exhibits significantly higher proportions of various types of bias compared to other domains.
  • Figure 2: Correlation heatmap for each bias dimension of different domains, calculated using Cramér's $\mathcal{V}$. Higher values indicate stronger correlations. Abbreviations: HS for Hate Speech, PB for Political Bias, GB for Gender Bias, RB for Racial Bias, LB for Linguistic Bias and TLCB for Text-level Context Bias.
  • Figure 3: Line plot visualizations of monthly aggregated counts of bias dimensions. Key observations include: 1) Hate speech manifests in varying proportions of the two types of style-based bias dimensions across different domains. 2) Notable surges in aggressive biases are observed in specific months within the politics domain, supporting our hypothesis that biases in these domains are more event-driven compared to others.