Table of Contents
Fetching ...

Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles

Qi Chen, Bowen Zhang, Gang Wang, Qi Wu

TL;DR

This paper introduces SPLAT, a dedicated benchmark that uses Situation Puzzles to evaluate and elicit lateral thinking in large language models. By adopting a multi-turn player-judge framework, SPLAT reduces reliance on stronger evaluation models while enabling open-ended, semantically grounded assessment of creative reasoning. The authors show that a strong judge model like WizardLM-2 aligns closely with human judgments and that the approach improves performance on related lateral thinking benchmarks such as RiddleSense. They also demonstrate that data and reasoning prompts derived from SPLAT can transfer to and boost performance on other lateral thinking tasks, highlighting SPLAT's broader impact on evaluating and enhancing creative problem-solving in LLMs.

Abstract

While advancements in NLP have significantly improved the performance of Large Language Models (LLMs) on tasks requiring vertical thinking, their lateral thinking capabilities remain under-explored and challenging to measure due to the complexity of assessing creative thought processes and the scarcity of relevant data. To address these challenges, we introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit LAteral Thinking of LLMs. This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation, which often necessitates a stronger evaluation model. This framework simulates an interactive game where the model (player) asks the evaluation model (judge) questions about an incomplete story to infer the full scenario. The judge answers based on a detailed reference scenario or evaluates if the player's predictions align with the reference one. This approach lessens dependence on more robust evaluation models, enabling the assessment of state-of-the-art LLMs. The experiments demonstrate that a robust evaluation model, such as WizardLM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy, achieving over 80% agreement-similar to the agreement levels among humans. Furthermore, applying data and reasoning processes from our benchmark to other lateral thinking-related benchmarks, e.g., RiddleSense and BrainTeaser, leads to performance enhancements. This suggests that our benchmark effectively evaluates and elicits the lateral thinking abilities of LLMs. Code is available at: https://github.com/chenqi008/LateralThinking.

Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles

TL;DR

This paper introduces SPLAT, a dedicated benchmark that uses Situation Puzzles to evaluate and elicit lateral thinking in large language models. By adopting a multi-turn player-judge framework, SPLAT reduces reliance on stronger evaluation models while enabling open-ended, semantically grounded assessment of creative reasoning. The authors show that a strong judge model like WizardLM-2 aligns closely with human judgments and that the approach improves performance on related lateral thinking benchmarks such as RiddleSense. They also demonstrate that data and reasoning prompts derived from SPLAT can transfer to and boost performance on other lateral thinking tasks, highlighting SPLAT's broader impact on evaluating and enhancing creative problem-solving in LLMs.

Abstract

While advancements in NLP have significantly improved the performance of Large Language Models (LLMs) on tasks requiring vertical thinking, their lateral thinking capabilities remain under-explored and challenging to measure due to the complexity of assessing creative thought processes and the scarcity of relevant data. To address these challenges, we introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit LAteral Thinking of LLMs. This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation, which often necessitates a stronger evaluation model. This framework simulates an interactive game where the model (player) asks the evaluation model (judge) questions about an incomplete story to infer the full scenario. The judge answers based on a detailed reference scenario or evaluates if the player's predictions align with the reference one. This approach lessens dependence on more robust evaluation models, enabling the assessment of state-of-the-art LLMs. The experiments demonstrate that a robust evaluation model, such as WizardLM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy, achieving over 80% agreement-similar to the agreement levels among humans. Furthermore, applying data and reasoning processes from our benchmark to other lateral thinking-related benchmarks, e.g., RiddleSense and BrainTeaser, leads to performance enhancements. This suggests that our benchmark effectively evaluates and elicits the lateral thinking abilities of LLMs. Code is available at: https://github.com/chenqi008/LateralThinking.

Paper Structure

This paper contains 26 sections, 7 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Examples from SPLAT benchmark, which are categorised into three ascending levels of difficulty, i.e., Easy, Medium, and Hard. The puzzles that are medium to hard in difficulty usually require guidance from the judge to be solved. If a puzzle can be solved without the judge's guidance, it typically resembles a regular puzzle requiring specific knowledge more than lateral thinking skills.
  • Figure 2: We show (a) the distribution of sample counts across different difficulty levels, and the average number of tokens per sample. We also exhibit (b) the distribution of time cost for human players and the number of tokens for each reference answer (200 samples in total) in difficulty levels.
  • Figure 3: Overall of the multi-turn player-judge framework. The player begins with a given story and poses yes/no questions to uncover hidden details. The judge, informed by a reference answer, responds to guide the player toward the correct reasoning. The player's goal is to deduce the scenario based on the judge's feedback and the initial story input. The game continues with questions until the player deduces the correct answer, at which point the judge confirms with a congratulatory response.
  • Figure 4: Performance of various LLMs on RiddleSense (dev set). Llama3 (8B & 70B) and GPT-4 are in the zero-shot setting, while others are trained on the training set of RiddleSense and CSQA saha2018complex. '*' means models with our auxiliary reasoning prompts.
  • Figure 5: Instructions to ask humans to write the judgement for the final answer.
  • ...and 1 more figures