Table of Contents
Fetching ...

WHODUNIT: Evaluation benchmark for culprit detection in mystery stories

Kshitij Gupta

TL;DR

The paper introduces WhoDunIt, a dataset designed to evaluate deductive reasoning of large language models in mystery-narrative contexts, augmented with diverse character-name substitutions to probe robustness against memorized associations. It systematically benchmarks GPT-4o, GPT-4-turbo, and GPT-4o-mini under multiple prompting styles (Basic, Chain-of-Thought, Self-Reflection, and their combination) and a 10-shot, majority-vote protocol. Key findings show strong performance for larger models on unaltered text, with accuracy diminished by certain name substitutions and enhanced by prompts that structure reasoning and validation; longer documents pose minimal risk for the top models but not for the smallest one. The study highlights the substantial influence of prompting strategies and data preparation on narrative deduction and points to future work on longer-form puzzles and bias mitigation, with the dataset publicly released for ongoing evaluation. The work contributes a targeted benchmark to probe deductive inference in narratives and informs model design toward robust, context-rich reasoning in lengthy text.

Abstract

We present a novel data set, WhoDunIt, to assess the deductive reasoning capabilities of large language models (LLM) within narrative contexts. Constructed from open domain mystery novels and short stories, the dataset challenges LLMs to identify the perpetrator after reading and comprehending the story. To evaluate model robustness, we apply a range of character-level name augmentations, including original names, name swaps, and substitutions with well-known real and/or fictional entities from popular discourse. We further use various prompting styles to investigate the influence of prompting on deductive reasoning accuracy. We conduct evaluation study with state-of-the-art models, specifically GPT-4o, GPT-4-turbo, and GPT-4o-mini, evaluated through multiple trials with majority response selection to ensure reliability. The results demonstrate that while LLMs perform reliably on unaltered texts, accuracy diminishes with certain name substitutions, particularly those with wide recognition. This dataset is publicly available here.

WHODUNIT: Evaluation benchmark for culprit detection in mystery stories

TL;DR

The paper introduces WhoDunIt, a dataset designed to evaluate deductive reasoning of large language models in mystery-narrative contexts, augmented with diverse character-name substitutions to probe robustness against memorized associations. It systematically benchmarks GPT-4o, GPT-4-turbo, and GPT-4o-mini under multiple prompting styles (Basic, Chain-of-Thought, Self-Reflection, and their combination) and a 10-shot, majority-vote protocol. Key findings show strong performance for larger models on unaltered text, with accuracy diminished by certain name substitutions and enhanced by prompts that structure reasoning and validation; longer documents pose minimal risk for the top models but not for the smallest one. The study highlights the substantial influence of prompting strategies and data preparation on narrative deduction and points to future work on longer-form puzzles and bias mitigation, with the dataset publicly released for ongoing evaluation. The work contributes a targeted benchmark to probe deductive inference in narratives and informs model design toward robust, context-rich reasoning in lengthy text.

Abstract

We present a novel data set, WhoDunIt, to assess the deductive reasoning capabilities of large language models (LLM) within narrative contexts. Constructed from open domain mystery novels and short stories, the dataset challenges LLMs to identify the perpetrator after reading and comprehending the story. To evaluate model robustness, we apply a range of character-level name augmentations, including original names, name swaps, and substitutions with well-known real and/or fictional entities from popular discourse. We further use various prompting styles to investigate the influence of prompting on deductive reasoning accuracy. We conduct evaluation study with state-of-the-art models, specifically GPT-4o, GPT-4-turbo, and GPT-4o-mini, evaluated through multiple trials with majority response selection to ensure reliability. The results demonstrate that while LLMs perform reliably on unaltered texts, accuracy diminishes with certain name substitutions, particularly those with wide recognition. This dataset is publicly available here.

Paper Structure

This paper contains 15 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Distribution by Length
  • Figure 2: Accuracy comparison across models
  • Figure 3: Accuracy distribution across the number of pages for different models.
  • Figure 4: Accuracy across different data augmentation techniques.
  • Figure 5: Accuracy across different prompting techniques.