HLB: Benchmarking LLMs' Humanlikeness in Language Use

Xufeng Duan; Bei Xiao; Xuemei Tang; Zhenguang G. Cai

HLB: Benchmarking LLMs' Humanlikeness in Language Use

Xufeng Duan, Bei Xiao, Xuemei Tang, Zhenguang G. Cai

TL;DR

HLB presents a psycholinguistic benchmark to quantify humanlikeness in language use by evaluating 20 LLMs on 10 experiments spanning sound, word, syntax, semantics, and discourse, with responses from over 2,000 humans. Humanlikeness is computed from task-response distributions using $HS_item = 1 - JS(P, Q) = 1 - \frac{1}{2} \left[ KL(P \\parallel M) + KL(Q \\parallel M) \right]$, where P and Q are the human and model distributions and M is their average. The results reveal domain-specific gaps and show that Llama-family models tend to achieve higher humanlike language use than OpenAI and Mistral models, while improvements on standard NLP metrics do not always translate to greater humanlike language use. The benchmark provides a principled framework for assessing humanlike language use in LLMs and guiding future development to better capture semantic and discourse processing in real-world language.

Abstract

As synthetic data becomes increasingly prevalent in training language models, particularly through generated dialogue, concerns have emerged that these models may deviate from authentic human language patterns, potentially losing the richness and creativity inherent in human communication. This highlights the critical need to assess the humanlikeness of language models in real-world language use. In this paper, we present a comprehensive humanlikeness benchmark (HLB) evaluating 20 large language models (LLMs) using 10 psycholinguistic experiments designed to probe core linguistic aspects, including sound, word, syntax, semantics, and discourse (see https://huggingface.co/spaces/XufengDuan/HumanLikeness). To anchor these comparisons, we collected responses from over 2,000 human participants and compared them to outputs from the LLMs in these experiments. For rigorous evaluation, we developed a coding algorithm that accurately identified language use patterns, enabling the extraction of response distributions for each task. By comparing the response distributions between human participants and LLMs, we quantified humanlikeness through distributional similarity. Our results reveal fine-grained differences in how well LLMs replicate human responses across various linguistic levels. Importantly, we found that improvements in other performance metrics did not necessarily lead to greater humanlikeness, and in some cases, even resulted in a decline. By introducing psycholinguistic methods to model evaluation, this benchmark offers the first framework for systematically assessing the humanlikeness of LLMs in language use.

HLB: Benchmarking LLMs' Humanlikeness in Language Use

TL;DR

, where P and Q are the human and model distributions and M is their average. The results reveal domain-specific gaps and show that Llama-family models tend to achieve higher humanlike language use than OpenAI and Mistral models, while improvements on standard NLP metrics do not always translate to greater humanlike language use. The benchmark provides a principled framework for assessing humanlike language use in LLMs and guiding future development to better capture semantic and discourse processing in real-world language.

Abstract

Paper Structure (16 sections, 3 equations, 2 figures, 2 tables)

This paper contains 16 sections, 3 equations, 2 figures, 2 tables.

Introduction
Related Work
Psychological Experimentation on LLMs
Psycholinguistic Experimentation on LLMs
Methodology
Human Experiments
LLM Experiments
Response Coding
Humanlikeness Scoring
Result
Comparative Analysis
Case analysis
Discussion
Conclusion
Limitation
...and 1 more sections

Figures (2)

Figure 1: The benchmark framework. The example prompt is taken from the sound-gender association task, where humans can infer the gender of a novel name (e.g., Pelcra for Female; Pelcrad for Male) based on phonology.
Figure 2: Humanlikeness scores of three LLM families

HLB: Benchmarking LLMs' Humanlikeness in Language Use

TL;DR

Abstract

HLB: Benchmarking LLMs' Humanlikeness in Language Use

Authors

TL;DR

Abstract

Table of Contents

Figures (2)