Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models
Minh Nguyen, Franck Dernoncourt, Seunghyun Yoon, Hanieh Deilamsalehy, Hao Tan, Ryan Rossi, Quan Hung Tran, Trung Bui, Thien Huu Nguyen
TL;DR
This work tackles the task of text-based SpeakerID, addressing the lack of large-scale datasets by constructing a MediaSum-derived corpus and proposing transformer-based models that exploit local dialogue context. It introduces two model families, a Single-Name model and a Multi-Name model (the latter optionally augmented with a Graph Convolutional Network), built around RoBERTa representations and a context-aware inference scheme. On a carefully constructed test set, the Single-Name model achieves the strongest precision of $80.3\%$, with recall of $50.0\%$ and F1 of $61.6\%$, while the Multi-Name and GCN variants yield slightly lower scores; however, the results are bounded by the fact that many speakers’ names are not mentioned in transcripts. The work provides publicly available data and code, highlighting the practical impact for accessibility and searchability of dialogue content and laying a foundation for future improvements in text-based SpeakerID.
Abstract
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives. Despite the advancements in speech recognition, the task of text-based speaker identification (SpeakerID) has received limited attention, lacking large-scale, diverse datasets for effective model training. Addressing these gaps, we present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources. We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names. Through extensive experiments, our best model achieves a great precision of 80.3\%, setting a new benchmark for SpeakerID. The data and code are publicly available here: \url{https://github.com/adobe-research/speaker-identification}
