Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora

Stefanie Urchs; Veronika Thurner; Matthias Aßenmacher; Christian Heumann; Stephanie Thiemichen

Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora

Stefanie Urchs, Veronika Thurner, Matthias Aßenmacher, Christian Heumann, Stephanie Thiemichen

TL;DR

The paper tackles the problem that large text corpora encode and amplify gender inequalities, which can bias NLP analyses. It introduces an actor-level, discourse-aware pipeline that extends prior work with metrics for syntactic agency, quotation style, sentiment, and PMI, plus structured reports and a two-stage exclusion process to balance corpora. Applied to the German taz2024full corpus (1980–2024), the approach yields a more gender-balanced dataset while preserving core discourse dynamics, demonstrating reductions in representation and framing asymmetries though subtler biases persist. The work emphasizes transparency and reproducibility through open-source tooling and discusses ethical considerations and avenues for future research, including non-binary representation and intersectional fairness.

Abstract

Language corpora are the foundation of most natural language processing research, yet they often reproduce structural inequalities. One such inequality is gender discrimination in how actors are represented, which can distort analyses and perpetuate discriminatory outcomes. This paper introduces a user-centric, actor-level pipeline for detecting and mitigating gender discrimination in large-scale text corpora. By combining discourse-aware analysis with metrics for sentiment, syntactic agency, and quotation styles, our method enables both fine-grained auditing and exclusion-based balancing. Applied to the taz2024full corpus of German newspaper articles (1980-2024), the pipeline yields a more gender-balanced dataset while preserving core dynamics of the source material. Our findings show that structural asymmetries can be reduced through systematic filtering, though subtler biases in sentiment and framing remain. We release the tools and reports to support further research in discourse-based fairness auditing and equitable corpus construction.

Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora

TL;DR

Abstract

Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)