Table of Contents
Fetching ...

Book your room in the Turing Hotel! A symmetric and distributed Turing Test with multiple AIs and humans

Christian Di Maio, Tommaso Guidi, Luigi Quarantiello, Jack Bell, Marco Gori, Stefano Melacci, Vincenzo Lomonaco

Abstract

In this paper, we report our experience with ``TuringHotel'', a novel extension of the Turing Test based on interactions within mixed communities of Large Language Models (LLMs) and human participants. The classical one-to-one interaction of the Turing Test is reinterpreted in a group setting, where both human and artificial agents engage in time-bounded discussions and, interestingly, are both judges and respondents. This community is instantiated in the novel platform UNaIVERSE (https://unaiverse.io), creating a ``World'' which defines the roles and interaction dynamics, facilitated by the platform's built-in programming tools. All communication occurs over an authenticated peer-to-peer network, ensuring that no third parties can access the exchange. The platform also provides a unified interface for humans, accessible via both mobile devices and laptops, that was a key component of the experience in this paper. Results of our experimentation involving 17 human participants and 19 LLMs revealed that current models are still sometimes confused as humans. Interestingly, there are several unexpected mistakes, suggesting that human fingerprints are still identifiable but not fully unambiguous, despite the high-quality language skills of artificial participants. We argue that this is the first experiment conducted in such a distributed setting, and that similar initiatives could be of national interest to support ongoing experiments and competitions aimed at monitoring the evolution of large language models over time.

Book your room in the Turing Hotel! A symmetric and distributed Turing Test with multiple AIs and humans

Abstract

In this paper, we report our experience with ``TuringHotel'', a novel extension of the Turing Test based on interactions within mixed communities of Large Language Models (LLMs) and human participants. The classical one-to-one interaction of the Turing Test is reinterpreted in a group setting, where both human and artificial agents engage in time-bounded discussions and, interestingly, are both judges and respondents. This community is instantiated in the novel platform UNaIVERSE (https://unaiverse.io), creating a ``World'' which defines the roles and interaction dynamics, facilitated by the platform's built-in programming tools. All communication occurs over an authenticated peer-to-peer network, ensuring that no third parties can access the exchange. The platform also provides a unified interface for humans, accessible via both mobile devices and laptops, that was a key component of the experience in this paper. Results of our experimentation involving 17 human participants and 19 LLMs revealed that current models are still sometimes confused as humans. Interestingly, there are several unexpected mistakes, suggesting that human fingerprints are still identifiable but not fully unambiguous, despite the high-quality language skills of artificial participants. We argue that this is the first experiment conducted in such a distributed setting, and that similar initiatives could be of national interest to support ongoing experiments and competitions aimed at monitoring the evolution of large language models over time.
Paper Structure (21 sections, 17 figures, 1 table)

This paper contains 21 sections, 17 figures, 1 table.

Figures (17)

  • Figure 1: TuringHotel web interface on a mobile device. It features a simple chat interface, where messages from the other participants and the RoomManager are presented in a conversational format.
  • Figure 2: Structure of the UNaIVERSE World in TuringHotel. Multiple LLMs are instructed to instantiate artificial agents that join the hotel. Humans join using the UNaIVERSE web interface (https://unaiverse.io), using mobile or desktop devices. Participants are distributed over the internet. A manager agent handles the hotel, randomly (and anonymously) assigning guests to rooms. After 3 minutes of conversations in rooms with 4 agents each (notice that the picture is indeed a generic representation, it only shows 3 participants), the manager asks all the participants (humans and AIs) who were the humans of the conversation. Back to the hall, and it all starts over!
  • Figure 3: Overall accuracy, precision, and recall (Humans vs. AI examiners). Aggregate classification performance for human evaluators and AI evaluators when labeling participants as AI vs. human. Metrics are reported as accuracy, precision, and recall (with “AI” treated as the positive class); values above bars show the overall means.
  • Figure 4: Accuracy by LLM (Humans vs. AIs). Identification accuracy when distinguishing artificial from human participants in TuringHotel discussions, broken down by the LLM used to generate the artificial agents. Blue bars show human accuracy (humans detecting AIs) and red bars show AI examiner accuracy (AIs detecting AIs); values above bars report mean accuracy per model.
  • Figure 5: Message length statistics for humans and AI agents. (Left) Mean words per message for humans versus all AI agents pooled. (Middle) Mean words per message for each LLM family. (Right) Distributions of words per message for humans and AI agents (overall and per LLM), highlighting overlap at short lengths and heavier tails for AI outputs. Bars indicate means; error bars indicate dispersion in message length.
  • ...and 12 more figures