Table of Contents
Fetching ...

Word Synchronization Challenge: A Benchmark for Word Association Responses for Large Language Models

Tanguy Cazalets, Joni Dambre

TL;DR

The paper proposes the Word Synchronization Challenge as a dynamic benchmark to evaluate LLMs’ ability to capture human word associations and social cognition in HCI. It employs a dyadic word game and dataset generation across multiple LLM pairings, analyzing interaction histories with embedding-based distances and PCA visualizations to study synchronization and strategy. Findings show that higher-sophistication models achieve higher success rates and favor a balancing strategy, with successful interactions revealing multi-manifold semantic convergence. The benchmark offers a flexible framework to assess human-like alignment and theory-of-mind in AI-assisted communication, informing the design of empathetic, collaborative human-AI systems and guiding future research on cognitive mechanisms and biases in language models.

Abstract

This paper introduces the Word Synchronization Challenge, a novel benchmark to evaluate large language models (LLMs) in Human-Computer Interaction (HCI). This benchmark uses a dynamic game-like framework to test LLMs ability to mimic human cognitive processes through word associations. By simulating complex human interactions, it assesses how LLMs interpret and align with human thought patterns during conversational exchanges, which are essential for effective social partnerships in HCI. Initial findings highlight the influence of model sophistication on performance, offering insights into the models capabilities to engage in meaningful social interactions and adapt behaviors in human-like ways. This research advances the understanding of LLMs potential to replicate or diverge from human cognitive functions, paving the way for more nuanced and empathetic human-machine collaborations.

Word Synchronization Challenge: A Benchmark for Word Association Responses for Large Language Models

TL;DR

The paper proposes the Word Synchronization Challenge as a dynamic benchmark to evaluate LLMs’ ability to capture human word associations and social cognition in HCI. It employs a dyadic word game and dataset generation across multiple LLM pairings, analyzing interaction histories with embedding-based distances and PCA visualizations to study synchronization and strategy. Findings show that higher-sophistication models achieve higher success rates and favor a balancing strategy, with successful interactions revealing multi-manifold semantic convergence. The benchmark offers a flexible framework to assess human-like alignment and theory-of-mind in AI-assisted communication, informing the design of empathetic, collaborative human-AI systems and guiding future research on cognitive mechanisms and biases in language models.

Abstract

This paper introduces the Word Synchronization Challenge, a novel benchmark to evaluate large language models (LLMs) in Human-Computer Interaction (HCI). This benchmark uses a dynamic game-like framework to test LLMs ability to mimic human cognitive processes through word associations. By simulating complex human interactions, it assesses how LLMs interpret and align with human thought patterns during conversational exchanges, which are essential for effective social partnerships in HCI. Initial findings highlight the influence of model sophistication on performance, offering insights into the models capabilities to engage in meaningful social interactions and adapt behaviors in human-like ways. This research advances the understanding of LLMs potential to replicate or diverge from human cognitive functions, paving the way for more nuanced and empathetic human-machine collaborations.

Paper Structure

This paper contains 22 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: An illustration of the game rules
  • Figure 2: Average Distance Between Embeddings for Model Combinations Over Last Rounds
  • Figure 3: Comparative analysis of model (here gpt-4o-mini playing against himself) preferred strategies showing preferences for word selection relative to the previous word and the average of the last two words
  • Figure 4: Two different views of the projection of the embedding of one game between two instances of GPT-4o-mini
  • Figure 5: 3D plots with two different views of the projection of the embedding for a won game between GPT-4-turbo and GPT-4o-mini
  • ...and 2 more figures