Table of Contents
Fetching ...

Automatic Classification of News Subjects in Broadcast News: Application to a Gender Bias Representation Analysis

Valentin Pelloin, Lena Dodson, Émile Chapuis, Nicolas Hervé, David Doukhan

TL;DR

We address the problem of quantifying gender-based topic representation biases in French broadcast news. The authors assemble a large-scale dataset of $11.7k$ hours of transcribed content from 21 channels, label topics with a few-shot Mixtral-8x7B LLM, and develop a scalable Teacher/Student distillation pipeline to train lighter classifiers. Evaluations on 804 manually annotated dialogues show that the approach yields competitive macro- and micro- F1 scores (best Macro-F1 around $58.5\%$ with Camembert-base and Micro-F1 around $62.5\%$), while enabling processing of over $2$ million dialogues with reduced cost. The study then analyzes gender biases in topic representation, finding that women speak less on sports but more on weather and health, with notable differences between private and public channels; these results demonstrate the framework’s potential for large-scale monitoring of gendered content in media. The work also releases the annotated dataset and discusses limitations, including binary gender labeling and future directions like improved topic segmentation and broader scope analyses.

Abstract

This paper introduces a computational framework designed to delineate gender distribution biases in topics covered by French TV and radio news. We transcribe a dataset of 11.7k hours, broadcasted in 2023 on 21 French channels. A Large Language Model (LLM) is used in few-shot conversation mode to obtain a topic classification on those transcriptions. Using the generated LLM annotations, we explore the finetuning of a specialized smaller classification model, to reduce the computational cost. To evaluate the performances of these models, we construct and annotate a dataset of 804 dialogues. This dataset is made available free of charge for research purposes. We show that women are notably underrepresented in subjects such as sports, politics and conflicts. Conversely, on topics such as weather, commercials and health, women have more speaking time than their overall average across all subjects. We also observe representations differences between private and public service channels.

Automatic Classification of News Subjects in Broadcast News: Application to a Gender Bias Representation Analysis

TL;DR

We address the problem of quantifying gender-based topic representation biases in French broadcast news. The authors assemble a large-scale dataset of hours of transcribed content from 21 channels, label topics with a few-shot Mixtral-8x7B LLM, and develop a scalable Teacher/Student distillation pipeline to train lighter classifiers. Evaluations on 804 manually annotated dialogues show that the approach yields competitive macro- and micro- F1 scores (best Macro-F1 around with Camembert-base and Micro-F1 around ), while enabling processing of over million dialogues with reduced cost. The study then analyzes gender biases in topic representation, finding that women speak less on sports but more on weather and health, with notable differences between private and public channels; these results demonstrate the framework’s potential for large-scale monitoring of gendered content in media. The work also releases the annotated dataset and discusses limitations, including binary gender labeling and future directions like improved topic segmentation and broader scope analyses.

Abstract

This paper introduces a computational framework designed to delineate gender distribution biases in topics covered by French TV and radio news. We transcribe a dataset of 11.7k hours, broadcasted in 2023 on 21 French channels. A Large Language Model (LLM) is used in few-shot conversation mode to obtain a topic classification on those transcriptions. Using the generated LLM annotations, we explore the finetuning of a specialized smaller classification model, to reduce the computational cost. To evaluate the performances of these models, we construct and annotate a dataset of 804 dialogues. This dataset is made available free of charge for research purposes. We show that women are notably underrepresented in subjects such as sports, politics and conflicts. Conversely, on topics such as weather, commercials and health, women have more speaking time than their overall average across all subjects. We also observe representations differences between private and public service channels.
Paper Structure (15 sections, 1 figure, 2 tables)