Table of Contents
Fetching ...

InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation

Yu Li, Pranav Narayanan Venkit, Yada Pruksachatkun, Chien-Sheng Wu

TL;DR

An interview-grounded evaluation framework for personality simulation at a large scale with a trade-off in how interview data is best utilized is revealed: retrieval-augmented methods excel at capturing personality style and response quality, while chronological-based methods better preserve factual consistency and knowledge retention.

Abstract

Simulating real personalities with large language models requires grounding generation in authentic personal data. Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said. We address this gap with an interview-grounded evaluation framework for personality simulation at a large scale. We extract over 671,000 question-answer pairs from 23,000 verified interview transcripts across 1,000 public personalities, each with an average of 11.5 hours of interview content. We propose a multi-dimensional evaluation framework with four complementary metrics measuring content similarity, factual consistency, personality alignment, and factual knowledge retention. Through systematic comparison, we demonstrate that methods grounded in real interview data substantially outperform those relying solely on biographical profiles or the model's parametric knowledge. We further reveal a trade-off in how interview data is best utilized: retrieval-augmented methods excel at capturing personality style and response quality, while chronological-based methods better preserve factual consistency and knowledge retention. Our evaluation framework enables principled method selection based on application requirements, and our empirical findings provide actionable insights for advancing personality simulation research.

InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation

TL;DR

An interview-grounded evaluation framework for personality simulation at a large scale with a trade-off in how interview data is best utilized is revealed: retrieval-augmented methods excel at capturing personality style and response quality, while chronological-based methods better preserve factual consistency and knowledge retention.

Abstract

Simulating real personalities with large language models requires grounding generation in authentic personal data. Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said. We address this gap with an interview-grounded evaluation framework for personality simulation at a large scale. We extract over 671,000 question-answer pairs from 23,000 verified interview transcripts across 1,000 public personalities, each with an average of 11.5 hours of interview content. We propose a multi-dimensional evaluation framework with four complementary metrics measuring content similarity, factual consistency, personality alignment, and factual knowledge retention. Through systematic comparison, we demonstrate that methods grounded in real interview data substantially outperform those relying solely on biographical profiles or the model's parametric knowledge. We further reveal a trade-off in how interview data is best utilized: retrieval-augmented methods excel at capturing personality style and response quality, while chronological-based methods better preserve factual consistency and knowledge retention. Our evaluation framework enables principled method selection based on application requirements, and our empirical findings provide actionable insights for advancing personality simulation research.
Paper Structure (70 sections, 13 equations, 14 figures, 15 tables)

This paper contains 70 sections, 13 equations, 14 figures, 15 tables.

Figures (14)

  • Figure 1: Overview of the InterviewSim framework. Left: The interview data collection pipeline selects 1,000 personalities, curates and verifies interview transcripts through automated filtering and human review, structures them into Q&A pairs across four thematic categories, and splits them temporally into training and test sets. Center: Any generation method can be applied using the training data to produce responses to held-out test questions. Right: The evaluation protocol assesses simulation fidelity along four complementary dimensions: content similarity, factual consistency, personality similarity, and factual knowledge retention via MCQ.
  • Figure 2: Distribution of 1,000 subjects across eight professional categories.
  • Figure 3: Contradiction ratio by question category across five methods. Social Identity questions are hardest for all methods; Motivations and Values are easiest.
  • Figure 4: Contradiction ratio by personality category. Entertainment personalities are hardest to simulate; Science & Academia are easiest. The pattern is consistent across methods.
  • Figure 5: Content Similarity evaluation prompt template for assessing semantic similarity between generated and ground truth responses.
  • ...and 9 more figures