Television Discourse Decoded: Comprehensive Multimodal Analytics at Scale

Anmol Agarwal; Pratyush Priyadarshi; Shiven Sinha; Shrey Gupta; Hitkul Jangra; Ponnurangam Kumaraguru; Kiran Garimella

Television Discourse Decoded: Comprehensive Multimodal Analytics at Scale

Anmol Agarwal, Pratyush Priyadarshi, Shiven Sinha, Shrey Gupta, Hitkul Jangra, Ponnurangam Kumaraguru, Kiran Garimella

TL;DR

The paper tackles the challenge of analyzing televised debates at scale by introducing a comprehensive multimodal analytics toolkit that fuses computer vision, speech-to-text, and NLP to transcribe, diarize, and analyze thousands of YouTube debates from a major Indian prime-time show. It builds a large-scale dataset (2,087 hours across 3,000 videos) and deploys a hybrid annotation pipeline (including LLM-assisted labeling) to quantify bias, gender representation, and incivility through metrics such as topic bias toward the ruling party, underrepresentation of women, overlapping speech, toxicity, and shouting. Key findings reveal a pro-ruling-party bias, persistent gender imbalance, and elevated incivility, with shouting averaging about 9% of debate duration and toxicity concentrated on sensitive topics; the work also demonstrates generalizability to other English-language debates. The study contributes a scalable methodology and openly shares code and data to catalyze further research in multimedia discourse analysis, with implications for media ethics, democratic deliberation, and policy.

Abstract

In this paper, we tackle the complex task of analyzing televised debates, with a focus on a prime time news debate show from India. Previous methods, which often relied solely on text, fall short in capturing the multimodal essence of these debates. To address this gap, we introduce a comprehensive automated toolkit that employs advanced computer vision and speech-to-text techniques for large-scale multimedia analysis. Utilizing state-of-the-art computer vision algorithms and speech-to-text methods, we transcribe, diarize, and analyze thousands of YouTube videos of a prime-time television debate show in India. These debates are a central part of Indian media but have been criticized for compromised journalistic integrity and excessive dramatization. Our toolkit provides concrete metrics to assess bias and incivility, capturing a comprehensive multimedia perspective that includes text, audio utterances, and video frames. Our findings reveal significant biases in topic selection and panelist representation, along with alarming levels of incivility. This work offers a scalable, automated approach for future research in multimedia analysis, with profound implications for the quality of public discourse and democratic debate. To catalyze further research in this area, we also release the code, dataset collected and supplemental pdf.

Television Discourse Decoded: Comprehensive Multimodal Analytics at Scale

TL;DR

Abstract

Paper Structure (28 sections, 8 figures, 5 tables)

This paper contains 28 sections, 8 figures, 5 tables.

Introduction
Background and Related Work
Bias and Incivility in Indian media
Analysis of TV News and Media
Multimodal Analysis Tools
Data Collection & Processing
Categorizing the Videos
Transcription and Speaker Diarization
Face and Gender Detection
Extracting Panelist Names from Transcripts
What is discussed in the debates?
Bias in Transcripts
Gender Bias
Incivility in the Debates
Overlapping Speech and Toxicity
...and 13 more sections

Figures (8)

Figure 1: Pipeline overview: Branch (a) details the process for identifying gender from facial data in videos and extracting hashtags from debate screens; Branch (b) outlines the audio cleaning and speaker diarization procedures, followed by transcription of utterances into text; Branch (c) illustrates the semi-automated annotation system that leverages YouTube metadata & LLMs to streamline the categorization of videos into categories, thereby reducing human annotation workload.
Figure 2: Fraction of panelists invited from the ruling party vs. the opposition. Pro-ruling-party panelists appear more than the opposition in almost all categories.
Figure 3: Average number of faces observed when a frame is randomly sampled from a video in the given month. Female guests are consistently underrepresented compared to their male counterparts.
Figure 4: Confidence Intervals. (a) Top-5 categories with more females than average. (b) Bottom-5 categories with less females than average. (c) Fraction of the total duration of videos exhibiting overlapped speech for the top-5 categories, significantly exceeding the dataset's mean. The highest-ranking category has 20% of video duration overlapping speech. (d) Fraction of the total duration of videos with overlapping speech for the bottom-5 categories, significantly below the dataset's mean. (e) Fraction of the total duration of videos with toxic speech in the top-5 most toxic categories. (f) Fraction of the total duration of videos with most shouting in the top-5 categories.
Figure 5: Comparison with other TV debate channels: (a) Fraction of video duration with overlapping speech. (b) Fraction of video duration with toxic speech.
...and 3 more figures

Television Discourse Decoded: Comprehensive Multimodal Analytics at Scale

TL;DR

Abstract

Television Discourse Decoded: Comprehensive Multimodal Analytics at Scale

Authors

TL;DR

Abstract

Table of Contents

Figures (8)