Table of Contents
Fetching ...

Speaker Verification in Agent-Generated Conversations

Yizhe Yang, Palakorn Achananuparp, Heyan Huang, Jing Jiang, Ee-Peng Lim

TL;DR

This work defines speaker verification in agent-based conversations to evaluate how well role-playing AI preserves a target speaker's identity. It builds a large, multi-source dataset and develops several verification models spanning style-based, authorship-based, and fine-tuned approaches using hierarchical utterance encoding with a contrastive objective. It then introduces Simulation Score and Distinction Score to assess how faithfully agents simulate individual speakers and how distinctly they render different roles; findings show that non-expert humans and ChatGPT struggle, while fine-tuned, mixed-feature models perform best, though topic and linguistic accommodation affect accuracy. The study provides a rigorous framework for evaluating personalization in conversational AI and highlights notable gaps and opportunities to improve the realism and controllability of role-playing agents.

Abstract

The recent success of large language models (LLMs) has attracted widespread interest to develop role-playing conversational agents personalized to the characteristics and styles of different speakers to enhance their abilities to perform both general and special purpose dialogue tasks. However, the ability to personalize the generated utterances to speakers, whether conducted by human or LLM, has not been well studied. To bridge this gap, our study introduces a novel evaluation challenge: speaker verification in agent-generated conversations, which aimed to verify whether two sets of utterances originate from the same speaker. To this end, we assemble a large dataset collection encompassing thousands of speakers and their utterances. We also develop and evaluate speaker verification models under experiment setups. We further utilize the speaker verification models to evaluate the personalization abilities of LLM-based role-playing models. Comprehensive experiments suggest that the current role-playing models fail in accurately mimicking speakers, primarily due to their inherent linguistic characteristics.

Speaker Verification in Agent-Generated Conversations

TL;DR

This work defines speaker verification in agent-based conversations to evaluate how well role-playing AI preserves a target speaker's identity. It builds a large, multi-source dataset and develops several verification models spanning style-based, authorship-based, and fine-tuned approaches using hierarchical utterance encoding with a contrastive objective. It then introduces Simulation Score and Distinction Score to assess how faithfully agents simulate individual speakers and how distinctly they render different roles; findings show that non-expert humans and ChatGPT struggle, while fine-tuned, mixed-feature models perform best, though topic and linguistic accommodation affect accuracy. The study provides a rigorous framework for evaluating personalization in conversational AI and highlights notable gaps and opportunities to improve the realism and controllability of role-playing agents.

Abstract

The recent success of large language models (LLMs) has attracted widespread interest to develop role-playing conversational agents personalized to the characteristics and styles of different speakers to enhance their abilities to perform both general and special purpose dialogue tasks. However, the ability to personalize the generated utterances to speakers, whether conducted by human or LLM, has not been well studied. To bridge this gap, our study introduces a novel evaluation challenge: speaker verification in agent-generated conversations, which aimed to verify whether two sets of utterances originate from the same speaker. To this end, we assemble a large dataset collection encompassing thousands of speakers and their utterances. We also develop and evaluate speaker verification models under experiment setups. We further utilize the speaker verification models to evaluate the personalization abilities of LLM-based role-playing models. Comprehensive experiments suggest that the current role-playing models fail in accurately mimicking speakers, primarily due to their inherent linguistic characteristics.
Paper Structure (31 sections, 3 equations, 6 figures, 14 tables)

This paper contains 31 sections, 3 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: The similarity score distribution of positive and negative real-generated pairs. The overlap in two distributions suggests that the generated utterances do not align closely with their corresponding real-world roles.
  • Figure 2: The similarity score distribution of positive and negative generated-generated pairs. The overlap in two distributions suggests that the generated utterances maintain consistency across different role settings.
  • Figure 3: Human questionnaire for speaker verification
  • Figure 4: Zero-Shot Prompt
  • Figure 5: COT Prompt
  • ...and 1 more figures