Table of Contents
Fetching ...

PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

Minseo Kim, Sujeong Im, Junseong Choi, Junhee Lee, Chaeeun Shim, Edward Choi

Abstract

Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent's responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: https://kaist-edlab.github.io/picon/

PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

Abstract

Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent's responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: https://kaist-edlab.github.io/picon/

Paper Structure

This paper contains 52 sections, 8 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: Framework Overview.PICon operates in two phases. The Interrogation Phase consists of three stages: (1) Get-to-Know, where baseline demographic questions are posed; (2) Main Interrogation, where the Questioner asks chained follow-up questions, the Entity & Claim Extractor identifies verifiable entities and claims, and the Questioner retrieves evidence via web search to generate confirmation questions; and (3) Retest, where earlier questions are re-asked. In the Evaluation Phase, the Evaluator assesses the full interrogation log across Internal Consistency, External Consistency, and Retest Consistency.
  • Figure 2: Consistency scores of human group and seven target groups (63 humans and 10 personas for each group). Left: Radar charts showing mean internal (IC), external (EC), and retest consistency (RC) for each persona; dashed lines denote standard deviations. Right: Normalized triangle areas under enclosed by the bold line as aggregate scores, with error bars representing standard deviation.
  • Figure 3: Trends in evaluation scores across metrics as the number of dialogue turns increases. Values represent mean scores across the seven persona groups.
  • Figure 4: Comparison between evaluation scores by proprietary API models and open-source models (dashed red lines).
  • Figure 5: Screenshots of the interview interface. (a) Home screen where participants begin the session. (b) Informed consent form presented prior to the interview. (c) Overview of the full website layout.
  • ...and 4 more figures