Characterizing Similarities and Divergences in Conversational Tones in Humans and LLMs by Sampling with People

Dun-Ming Huang; Pol Van Rijn; Ilia Sucholutsky; Raja Marjieh; Nori Jacoby

Characterizing Similarities and Divergences in Conversational Tones in Humans and LLMs by Sampling with People

Dun-Ming Huang, Pol Van Rijn, Ilia Sucholutsky, Raja Marjieh, Nori Jacoby

TL;DR

This work tackles the challenge of comparing conversational tones between humans and LLMs by introducing Sampling with People SP, a cognitive-science grounded, iterative elicitation framework that jointly samples tones and sentences and uses a Gibbs sampler like process. Through SP, followed by quality-of-fit annotations and shared geometric embedding via Multidimensional Scaling, the authors provide a cross-domain representation of tones and demonstrate how it can benchmark unsupervised semantic alignment methods, with BLI outperforming GWOT and Procrustes in recovering cross-domain structure. The study reveals that human and GPT tonal representations cluster by valence, yet differ on specific cues such as arousal and relational meaning, and that a shared tone space enables translating tones across domains. The dataset and methods offer a resource for evaluating and improving human–AI communication, with potential extensions to multilingual and cross-cultural contexts and applications in AI alignment.

Abstract

Conversational tones -- the manners and attitudes in which speakers communicate -- are essential to effective communication. Amidst the increasing popularization of Large Language Models (LLMs) over recent years, it becomes necessary to characterize the divergences in their conversational tones relative to humans. However, existing investigations of conversational modalities rely on pre-existing taxonomies or text corpora, which suffer from experimenter bias and may not be representative of real-world distributions for the studies' psycholinguistic domains. Inspired by methods from cognitive science, we propose an iterative method for simultaneously eliciting conversational tones and sentences, where participants alternate between two tasks: (1) one participant identifies the tone of a given sentence and (2) a different participant generates a sentence based on that tone. We run 100 iterations of this process with human participants and GPT-4, then obtain a dataset of sentences and frequent conversational tones. In an additional experiment, humans and GPT-4 annotated all sentences with all tones. With data from 1,339 human participants, 33,370 human judgments, and 29,900 GPT-4 queries, we show how our approach can be used to create an interpretable geometric representation of relations between conversational tones in humans and GPT-4. This work demonstrates how combining ideas from machine learning and cognitive science can address challenges in human-computer interactions.

Characterizing Similarities and Divergences in Conversational Tones in Humans and LLMs by Sampling with People

TL;DR

Abstract

Paper Structure (40 sections, 18 figures, 3 tables)

This paper contains 40 sections, 18 figures, 3 tables.

Introduction
Detailed Approach
Elicitation via Sampling with People
Annotation via Quality-of-fit Rating
Geometric Representation of Conversational Tones
Application: Benchmarking Semantic Alignment Methods
Method
Participants
General Procedure
Results
Elicitation (Sampling with People)
Annotation via Quality-of-fit Rating
Conversational Tone Representation (Multidimensional scaling)
Application: Ground Truth for Benchmarking Semantic Alignment Methods
Discussion
...and 25 more sections

Figures (18)

Figure 1: Summary of our approach. A: Problem statement. B: The Sampling with People paradigm that aims to collect a representative sample of conversational tones and sentences. C: A quality-of-fit rating procedure that allows us to obtain vector representations of conversational tones with respect to their usage context. D: A geometric representation of the shared embedding space across elicited domains (human, GPT). E: As an application of our obtained data, we benchmark a selection of popular unsupervised cross-domain alignment methods.
Figure 2: Results of Sampling with People and Quality-of-fit Rating paradigms and comparison to similarity judgments paradigm. A: Selection of most popular conversational tones from each of human and GPT instances, and their frequencies in respective samples (red for humans, blue for GPT). Error bars represent one standard deviation via bootstrapping. B: Correlation matrices of conversational tone quality-of-fit rating embeddings within humans (on the left) and within GPT (on the right). C: Cross-domain (Cross-correlation) matrix of human rating embeddings and GPT rating embeddings for conversational tones, and a bar plot showing the correlation between human ratings and GPT rating embeddings for each conversational tone word. Error bars represent one standard deviation via bootstrapping. D: Similarity judgment-derived similarity matrices of conversational tone from humans (on the left) and GPT (on the right). See enlarged version of this figure in the Appendix (Figure \ref{['fig:dense-large']}).
Figure 3: Cross-correlation alignment information. Blue points/arrowmarks in A and C represent GPT-originated data, while red represents human-originated instead. A: The MDS solution of applied to the combined within/across cross-domain (cross-correlation) matrix as a set of high-dimensional embedding to represent shared space of conversational tones embeddings across humans and GPT. Grey edges connect points representing the same conversational tone word. Arrow marks represent rating-derived dimensions of conversational tones. B: A barplot exhibiting the Euclidean distance between pairs of the same conversational tone embeddings in MDS space. Error bars represent one standard deviation via bootstrapping. C: A graph showing the nearest neighbor matches of conversational tone embeddings across humans and GPT. To measure robustness in matching, we bootstrapped the process 5000 times. Dark edges represent the frequency of its matching throughout bootstrap processes. See enlarged version of this figure in the Appendix (Figure \ref{['fig:mds-large']}).
Figure 4: Instructions shown to participants.
Figure 5: ChatGPT prompts for creating seeds that generate the text seen in interface detailed at Figure \ref{['fig:operationalize-b']}
...and 13 more figures

Characterizing Similarities and Divergences in Conversational Tones in Humans and LLMs by Sampling with People

TL;DR

Abstract

Characterizing Similarities and Divergences in Conversational Tones in Humans and LLMs by Sampling with People

Authors

TL;DR

Abstract

Table of Contents

Figures (18)