Table of Contents
Fetching ...

Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora

Stefanie Urchs, Veronika Thurner, Matthias Aßenmacher, Christian Heumann, Stephanie Thiemichen

TL;DR

The paper tackles the problem that large text corpora encode and amplify gender inequalities, which can bias NLP analyses. It introduces an actor-level, discourse-aware pipeline that extends prior work with metrics for syntactic agency, quotation style, sentiment, and PMI, plus structured reports and a two-stage exclusion process to balance corpora. Applied to the German taz2024full corpus (1980–2024), the approach yields a more gender-balanced dataset while preserving core discourse dynamics, demonstrating reductions in representation and framing asymmetries though subtler biases persist. The work emphasizes transparency and reproducibility through open-source tooling and discusses ethical considerations and avenues for future research, including non-binary representation and intersectional fairness.

Abstract

Language corpora are the foundation of most natural language processing research, yet they often reproduce structural inequalities. One such inequality is gender discrimination in how actors are represented, which can distort analyses and perpetuate discriminatory outcomes. This paper introduces a user-centric, actor-level pipeline for detecting and mitigating gender discrimination in large-scale text corpora. By combining discourse-aware analysis with metrics for sentiment, syntactic agency, and quotation styles, our method enables both fine-grained auditing and exclusion-based balancing. Applied to the taz2024full corpus of German newspaper articles (1980-2024), the pipeline yields a more gender-balanced dataset while preserving core dynamics of the source material. Our findings show that structural asymmetries can be reduced through systematic filtering, though subtler biases in sentiment and framing remain. We release the tools and reports to support further research in discourse-based fairness auditing and equitable corpus construction.

Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora

TL;DR

The paper tackles the problem that large text corpora encode and amplify gender inequalities, which can bias NLP analyses. It introduces an actor-level, discourse-aware pipeline that extends prior work with metrics for syntactic agency, quotation style, sentiment, and PMI, plus structured reports and a two-stage exclusion process to balance corpora. Applied to the German taz2024full corpus (1980–2024), the approach yields a more gender-balanced dataset while preserving core discourse dynamics, demonstrating reductions in representation and framing asymmetries though subtler biases persist. The work emphasizes transparency and reproducibility through open-source tooling and discusses ethical considerations and avenues for future research, including non-binary representation and intersectional fairness.

Abstract

Language corpora are the foundation of most natural language processing research, yet they often reproduce structural inequalities. One such inequality is gender discrimination in how actors are represented, which can distort analyses and perpetuate discriminatory outcomes. This paper introduces a user-centric, actor-level pipeline for detecting and mitigating gender discrimination in large-scale text corpora. By combining discourse-aware analysis with metrics for sentiment, syntactic agency, and quotation styles, our method enables both fine-grained auditing and exclusion-based balancing. Applied to the taz2024full corpus of German newspaper articles (1980-2024), the pipeline yields a more gender-balanced dataset while preserving core dynamics of the source material. Our findings show that structural asymmetries can be reduced through systematic filtering, though subtler biases in sentiment and framing remain. We release the tools and reports to support further research in discourse-based fairness auditing and equitable corpus construction.

Paper Structure

This paper contains 21 sections, 3 equations, 10 figures.

Figures (10)

  • Figure 1: Percentage of male- and female-coded references over time before filtering. Fluctuations in the early years reflect the small number of available articles.
  • Figure 2: Distribution of quotation styles by gender before filtering. Early-year fluctuations are attributable to low article counts.
  • Figure 3: Distribution of syntactic roles by gender before filtering. Early-year fluctuations are attributable to low article counts.
  • Figure 4: Distribution of gender ratios across articles before filtering.
  • Figure 5: Proportion of excluded texts per year by flag type. Subject-role asymmetry dominates, while naming, quoting, and sentiment gaps occur less frequently.
  • ...and 5 more figures