Automatic Classification of News Subjects in Broadcast News: Application to a Gender Bias Representation Analysis

Valentin Pelloin; Lena Dodson; Émile Chapuis; Nicolas Hervé; David Doukhan

Automatic Classification of News Subjects in Broadcast News: Application to a Gender Bias Representation Analysis

Valentin Pelloin, Lena Dodson, Émile Chapuis, Nicolas Hervé, David Doukhan

TL;DR

We address the problem of quantifying gender-based topic representation biases in French broadcast news. The authors assemble a large-scale dataset of $11.7k$ hours of transcribed content from 21 channels, label topics with a few-shot Mixtral-8x7B LLM, and develop a scalable Teacher/Student distillation pipeline to train lighter classifiers. Evaluations on 804 manually annotated dialogues show that the approach yields competitive macro- and micro- F1 scores (best Macro-F1 around $58.5\%$ with Camembert-base and Micro-F1 around $62.5\%$), while enabling processing of over $2$ million dialogues with reduced cost. The study then analyzes gender biases in topic representation, finding that women speak less on sports but more on weather and health, with notable differences between private and public channels; these results demonstrate the framework’s potential for large-scale monitoring of gendered content in media. The work also releases the annotated dataset and discusses limitations, including binary gender labeling and future directions like improved topic segmentation and broader scope analyses.

Abstract

This paper introduces a computational framework designed to delineate gender distribution biases in topics covered by French TV and radio news. We transcribe a dataset of 11.7k hours, broadcasted in 2023 on 21 French channels. A Large Language Model (LLM) is used in few-shot conversation mode to obtain a topic classification on those transcriptions. Using the generated LLM annotations, we explore the finetuning of a specialized smaller classification model, to reduce the computational cost. To evaluate the performances of these models, we construct and annotate a dataset of 804 dialogues. This dataset is made available free of charge for research purposes. We show that women are notably underrepresented in subjects such as sports, politics and conflicts. Conversely, on topics such as weather, commercials and health, women have more speaking time than their overall average across all subjects. We also observe representations differences between private and public service channels.

Automatic Classification of News Subjects in Broadcast News: Application to a Gender Bias Representation Analysis

TL;DR

We address the problem of quantifying gender-based topic representation biases in French broadcast news. The authors assemble a large-scale dataset of

hours of transcribed content from 21 channels, label topics with a few-shot Mixtral-8x7B LLM, and develop a scalable Teacher/Student distillation pipeline to train lighter classifiers. Evaluations on 804 manually annotated dialogues show that the approach yields competitive macro- and micro- F1 scores (best Macro-F1 around

with Camembert-base and Micro-F1 around

), while enabling processing of over

million dialogues with reduced cost. The study then analyzes gender biases in topic representation, finding that women speak less on sports but more on weather and health, with notable differences between private and public channels; these results demonstrate the framework’s potential for large-scale monitoring of gendered content in media. The work also releases the annotated dataset and discusses limitations, including binary gender labeling and future directions like improved topic segmentation and broader scope analyses.

Abstract

Paper Structure (15 sections, 1 figure, 2 tables)

This paper contains 15 sections, 1 figure, 2 tables.

Introduction
Related works
Data description
Preprocessing
Annotation guidelines
Annotation campaign
Methodology
Baseline BERT classification models
Mixtral-8x7B few-shot classification
Teacher/Student models
Results
Automatic system evaluation
Gender representation biases in broadcast news
Conclusion
Acknowledgements

Figures (1)

Figure 1: Measured gender representation bias per topic.

Automatic Classification of News Subjects in Broadcast News: Application to a Gender Bias Representation Analysis

TL;DR

Abstract

Automatic Classification of News Subjects in Broadcast News: Application to a Gender Bias Representation Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (1)