Table of Contents
Fetching ...

Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue

Jonathan Ivey, Shivani Kumar, Jiayu Liu, Hua Shen, Sushrita Rakshit, Rohan Raju, Haotian Zhang, Aparna Ananthasubramaniam, Junghwan Kim, Bowen Yi, Dustin Wright, Abraham Israeli, Anders Giovanni Møller, Lechen Zhang, David Jurgens

TL;DR

This work evaluates the fidelity of LLM-based simulations of human dialogue by extending WildChat conversations with Simulator-generated Turn 3 and measuring similarity across 21 linguistic metrics in English, Chinese, and Russian. It reveals substantial divergence between simulated and human replies, with higher alignment in semantic and stylistic features but weaker syntactic and conversational-dynamics fidelity; prompts have a larger impact than the choice of Simulator model. The paper provides a scalable evaluation framework, a large multilingual dataset, and regression analyses showing that human Turn 1 style and topic strongly modulate simulation quality, suggesting the need for prompt engineering and potential human-in-the-loop approaches for higher-fidelity datasets. Overall, while LLM simulations can reduce data-collection costs, current open-weight models struggle to faithfully replicate human responses, particularly in structured or domain-specific contexts, underscoring limitations and guiding future improvements in prompting, tuning, and evaluation protocols.

Abstract

Studying and building datasets for dialogue tasks is both expensive and time-consuming due to the need to recruit, train, and collect data from study participants. In response, much recent work has sought to use large language models (LLMs) to simulate both human-human and human-LLM interactions, as they have been shown to generate convincingly human-like text in many settings. However, to what extent do LLM-based simulations \textit{actually} reflect human dialogues? In this work, we answer this question by generating a large-scale dataset of 100,000 paired LLM-LLM and human-LLM dialogues from the WildChat dataset and quantifying how well the LLM simulations align with their human counterparts. Overall, we find relatively low alignment between simulations and human interactions, demonstrating a systematic divergence along the multiple textual properties, including style and content. Further, in comparisons of English, Chinese, and Russian dialogues, we find that models perform similarly. Our results suggest that LLMs generally perform better when the human themself writes in a way that is more similar to the LLM's own style.

Real or Robotic? Assessing Whether LLMs Accurately Simulate Qualities of Human Responses in Dialogue

TL;DR

This work evaluates the fidelity of LLM-based simulations of human dialogue by extending WildChat conversations with Simulator-generated Turn 3 and measuring similarity across 21 linguistic metrics in English, Chinese, and Russian. It reveals substantial divergence between simulated and human replies, with higher alignment in semantic and stylistic features but weaker syntactic and conversational-dynamics fidelity; prompts have a larger impact than the choice of Simulator model. The paper provides a scalable evaluation framework, a large multilingual dataset, and regression analyses showing that human Turn 1 style and topic strongly modulate simulation quality, suggesting the need for prompt engineering and potential human-in-the-loop approaches for higher-fidelity datasets. Overall, while LLM simulations can reduce data-collection costs, current open-weight models struggle to faithfully replicate human responses, particularly in structured or domain-specific contexts, underscoring limitations and guiding future improvements in prompting, tuning, and evaluation protocols.

Abstract

Studying and building datasets for dialogue tasks is both expensive and time-consuming due to the need to recruit, train, and collect data from study participants. In response, much recent work has sought to use large language models (LLMs) to simulate both human-human and human-LLM interactions, as they have been shown to generate convincingly human-like text in many settings. However, to what extent do LLM-based simulations \textit{actually} reflect human dialogues? In this work, we answer this question by generating a large-scale dataset of 100,000 paired LLM-LLM and human-LLM dialogues from the WildChat dataset and quantifying how well the LLM simulations align with their human counterparts. Overall, we find relatively low alignment between simulations and human interactions, demonstrating a systematic divergence along the multiple textual properties, including style and content. Further, in comparisons of English, Chinese, and Russian dialogues, we find that models perform similarly. Our results suggest that LLMs generally perform better when the human themself writes in a way that is more similar to the LLM's own style.
Paper Structure (34 sections, 12 figures, 10 tables)

This paper contains 34 sections, 12 figures, 10 tables.

Figures (12)

  • Figure 1: A sample conversation between a human and GPT-3.5 on WildChat and Llama3.1-8B's simulation of Turn 3 of the conversation. In this study, we compare the Simulator's output against the Human's output using 21 metrics, covering lexical, syntactic, semantic, and stylistic features.
  • Figure 2: How well do LLMs simulate Human responses to a Chatbot? We compare the nine models used as Simulators to the original Human by correlating properties of the text they write (Table \ref{['tab:eval-metrics']}). Bars represent the average correlation across all metrics in a category, and error bars are bootstrapped 95% confidence intervals over these metrics. As a baseline, we also compare the performance of a human annotator on this task. There is limited cross-model variation in performance, and Simulators tend to have higher performance in semantic features and lower performance in syntactic features, while the opposite is true of the human annotators.
  • Figure 3: How well do LLMs predict whether Humans end a conversation with Chatbot after the first turn? Each bar represents the binary F1 score of each model predicting whether a conversation will end. The gray horizontal lines show the performance of human annotators and a random baseline that ends the conversation 50% of the time. While there is inter-model variation, all models perform worse than chance. The human annotator performs better than chance.
  • Figure 4: Using the methods from Figure \ref{['fig:model-differences']}, the performance of the best and worst prompts and annotators are compared across metric categories. The best (a direct prompt) and worst (an override prompt) prompts are selected based on an overall average across all metrics and shown in Table \ref{['tab:top-prompts']}. The worst prompt underperforms the best prompt in all categories, and annotators outperform all prompts in syntax metrics.
  • Figure 5: How well do Simulators replicate Human text across languages? Similar to Figures \ref{['fig:model-differences']} and \ref{['fig:end-differences']}, we plot the similarity between Simulator and Human text across ten metrics in three languages. English, Chinese, and Russian have similar performance patterns across all five categories of metrics. However, some differences exist (e.g., Chinese Simulators outperform other languages in lexical and semantic metrics but underperform in conversation endings). Correlations of individual metrics are shown in Tables \ref{['tab:ind_corr_cn']} and \ref{['tab:ind_corr_ru']}.
  • ...and 7 more figures