Speaker Verification in Agent-Generated Conversations
Yizhe Yang, Palakorn Achananuparp, Heyan Huang, Jing Jiang, Ee-Peng Lim
TL;DR
This work defines speaker verification in agent-based conversations to evaluate how well role-playing AI preserves a target speaker's identity. It builds a large, multi-source dataset and develops several verification models spanning style-based, authorship-based, and fine-tuned approaches using hierarchical utterance encoding with a contrastive objective. It then introduces Simulation Score and Distinction Score to assess how faithfully agents simulate individual speakers and how distinctly they render different roles; findings show that non-expert humans and ChatGPT struggle, while fine-tuned, mixed-feature models perform best, though topic and linguistic accommodation affect accuracy. The study provides a rigorous framework for evaluating personalization in conversational AI and highlights notable gaps and opportunities to improve the realism and controllability of role-playing agents.
Abstract
The recent success of large language models (LLMs) has attracted widespread interest to develop role-playing conversational agents personalized to the characteristics and styles of different speakers to enhance their abilities to perform both general and special purpose dialogue tasks. However, the ability to personalize the generated utterances to speakers, whether conducted by human or LLM, has not been well studied. To bridge this gap, our study introduces a novel evaluation challenge: speaker verification in agent-generated conversations, which aimed to verify whether two sets of utterances originate from the same speaker. To this end, we assemble a large dataset collection encompassing thousands of speakers and their utterances. We also develop and evaluate speaker verification models under experiment setups. We further utilize the speaker verification models to evaluate the personalization abilities of LLM-based role-playing models. Comprehensive experiments suggest that the current role-playing models fail in accurately mimicking speakers, primarily due to their inherent linguistic characteristics.
