Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models

Minh Nguyen; Franck Dernoncourt; Seunghyun Yoon; Hanieh Deilamsalehy; Hao Tan; Ryan Rossi; Quan Hung Tran; Trung Bui; Thien Huu Nguyen

Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models

Minh Nguyen, Franck Dernoncourt, Seunghyun Yoon, Hanieh Deilamsalehy, Hao Tan, Ryan Rossi, Quan Hung Tran, Trung Bui, Thien Huu Nguyen

TL;DR

This work tackles the task of text-based SpeakerID, addressing the lack of large-scale datasets by constructing a MediaSum-derived corpus and proposing transformer-based models that exploit local dialogue context. It introduces two model families, a Single-Name model and a Multi-Name model (the latter optionally augmented with a Graph Convolutional Network), built around RoBERTa representations and a context-aware inference scheme. On a carefully constructed test set, the Single-Name model achieves the strongest precision of $80.3\%$, with recall of $50.0\%$ and F1 of $61.6\%$, while the Multi-Name and GCN variants yield slightly lower scores; however, the results are bounded by the fact that many speakers’ names are not mentioned in transcripts. The work provides publicly available data and code, highlighting the practical impact for accessibility and searchability of dialogue content and laying a foundation for future improvements in text-based SpeakerID.

Abstract

We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives. Despite the advancements in speech recognition, the task of text-based speaker identification (SpeakerID) has received limited attention, lacking large-scale, diverse datasets for effective model training. Addressing these gaps, we present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources. We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names. Through extensive experiments, our best model achieves a great precision of 80.3\%, setting a new benchmark for SpeakerID. The data and code are publicly available here: \url{https://github.com/adobe-research/speaker-identification}

Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models

TL;DR

, with recall of

and F1 of

, while the Multi-Name and GCN variants yield slightly lower scores; however, the results are bounded by the fact that many speakers’ names are not mentioned in transcripts. The work provides publicly available data and code, highlighting the practical impact for accessibility and searchability of dialogue content and laying a foundation for future improvements in text-based SpeakerID.

Abstract

Paper Structure (15 sections, 2 equations, 3 figures, 3 tables)

This paper contains 15 sections, 2 equations, 3 figures, 3 tables.

Introduction
Methodology
Problem Definition
Data Collection
Proposed Models
Single-Name Model
Multi-Name Model
Inference
Experiments
Dataset
Hyper-parameters
Evaluation Metrics
Results
Related Work
Conclusions

Figures (3)

Figure 1: An example in the MediaSum dataset. In the SpeakerID setting, the speakers are not provided with their names at test time but their speaker identities such as "speaker1", "speaker2" produced by a speaker diarization system. A model performing SpeakerID needs to recover the actual names for the speakers based on the transcript.
Figure 2: Overview of our proposed single-name model for SpeakerID.
Figure 3: Overview of our proposed multi-name model for SpeakerID.

Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models

TL;DR

Abstract

Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)