Table of Contents
Fetching ...

LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles

Shulin Huang, Shirong Ma, Yinghui Li, Mengzuo Huang, Wuhe Zou, Weidong Zhang, Hai-Tao Zheng

TL;DR

This work introduces LatEval, an interactive benchmark that assesses LLMs' lateral thinking through a host-player Lateral Thinking Puzzle framework. It provides a 325-entry bilingual (English-Chinese) dataset with annotated clues, and defines metrics (AC, QR, QD, AT) to quantify question quality, information integration, and deduction accuracy. Experimental results show most models struggle with true lateral thinking, with GPT-4 leading but still under human performance, highlighting a meaningful evaluation gap. The dataset and code aim to drive development of AI assistants capable of divergent questioning and robust information synthesis in interactive tasks.

Abstract

With the continuous evolution and refinement of LLMs, they are endowed with impressive logical reasoning or vertical thinking capabilities. But can they think out of the box? Do they possess proficient lateral thinking abilities? Following the setup of Lateral Thinking Puzzles, we propose a novel evaluation benchmark, LatEval, which assesses the model's lateral thinking within an interactive framework. In our benchmark, we challenge LLMs with 2 aspects: the quality of questions posed by the model and the model's capability to integrate information for problem-solving. We find that nearly all LLMs struggle with employing lateral thinking during interactions. For example, even the most advanced model, GPT-4, exhibits the advantage to some extent, yet still maintain a noticeable gap when compared to human. This evaluation benchmark provides LLMs with a highly challenging and distinctive task that is crucial to an effective AI assistant.

LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles

TL;DR

This work introduces LatEval, an interactive benchmark that assesses LLMs' lateral thinking through a host-player Lateral Thinking Puzzle framework. It provides a 325-entry bilingual (English-Chinese) dataset with annotated clues, and defines metrics (AC, QR, QD, AT) to quantify question quality, information integration, and deduction accuracy. Experimental results show most models struggle with true lateral thinking, with GPT-4 leading but still under human performance, highlighting a meaningful evaluation gap. The dataset and code aim to drive development of AI assistants capable of divergent questioning and robust information synthesis in interactive tasks.

Abstract

With the continuous evolution and refinement of LLMs, they are endowed with impressive logical reasoning or vertical thinking capabilities. But can they think out of the box? Do they possess proficient lateral thinking abilities? Following the setup of Lateral Thinking Puzzles, we propose a novel evaluation benchmark, LatEval, which assesses the model's lateral thinking within an interactive framework. In our benchmark, we challenge LLMs with 2 aspects: the quality of questions posed by the model and the model's capability to integrate information for problem-solving. We find that nearly all LLMs struggle with employing lateral thinking during interactions. For example, even the most advanced model, GPT-4, exhibits the advantage to some extent, yet still maintain a noticeable gap when compared to human. This evaluation benchmark provides LLMs with a highly challenging and distinctive task that is crucial to an effective AI assistant.
Paper Structure (17 sections, 3 equations, 3 figures, 5 tables)

This paper contains 17 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The Comparison of Vertical Thinking and Lateral Thinking. Vertical Thinking typically refers to thinking within established or conventional thought patterns, following the known rules. Lateral Thinking involves breaking out of traditional thought patterns and employing innovative approaches to explore non-conventional solutions.
  • Figure 2: An example of a Lateral Thinking Puzzle in our benchmark, including several turns of interaction between the host and the player, and the automatic evaluation for posed questions and player's answer.
  • Figure 3: Lateral thinking performance of various LLMs under various difficulty settings: providing 0% Clues, 50% Clues and 100% Clues. We report two metrics: ROUGE and Average Turns.