Table of Contents
Fetching ...

Medal Matters: Probing LLMs' Failure Cases Through Olympic Rankings

Juhwan Choi, Seunguk Yu, JungMin Yun, YoungBin Kim

TL;DR

This work probes whether LLMs organize knowledge similarly to humans by using Olympic medal tallies to compare Medal QA (retrieving exact medal counts) with Team QA (inferring rankings). Through a closed-book evaluation of 12 models on 650 teams across 34 Games, the study finds that models excel at recalling counts but falter at deriving rankings, revealing a gap in internal knowledge integration. It also uncovers vulnerability to doubt prompts, with a measurable decline in accuracy via a doubt-matrix, suggesting instability in self-correction without evidence. The findings highlight fundamental differences between human-like reasoning and next-token prediction, and point to future directions such as graph-based pretraining to improve information linking and robustness; the authors also release code, data, and outputs to foster further research.

Abstract

Large language models (LLMs) have achieved remarkable success in natural language processing tasks, yet their internal knowledge structures remain poorly understood. This study examines these structures through the lens of historical Olympic medal tallies, evaluating LLMs on two tasks: (1) retrieving medal counts for specific teams and (2) identifying rankings of each team. While state-of-the-art LLMs excel in recalling medal counts, they struggle with providing rankings, highlighting a key difference between their knowledge organization and human reasoning. These findings shed light on the limitations of LLMs' internal knowledge integration and suggest directions for improvement. To facilitate further research, we release our code, dataset, and model outputs.

Medal Matters: Probing LLMs' Failure Cases Through Olympic Rankings

TL;DR

This work probes whether LLMs organize knowledge similarly to humans by using Olympic medal tallies to compare Medal QA (retrieving exact medal counts) with Team QA (inferring rankings). Through a closed-book evaluation of 12 models on 650 teams across 34 Games, the study finds that models excel at recalling counts but falter at deriving rankings, revealing a gap in internal knowledge integration. It also uncovers vulnerability to doubt prompts, with a measurable decline in accuracy via a doubt-matrix, suggesting instability in self-correction without evidence. The findings highlight fundamental differences between human-like reasoning and next-token prediction, and point to future directions such as graph-based pretraining to improve information linking and robustness; the authors also release code, data, and outputs to foster further research.

Abstract

Large language models (LLMs) have achieved remarkable success in natural language processing tasks, yet their internal knowledge structures remain poorly understood. This study examines these structures through the lens of historical Olympic medal tallies, evaluating LLMs on two tasks: (1) retrieving medal counts for specific teams and (2) identifying rankings of each team. While state-of-the-art LLMs excel in recalling medal counts, they struggle with providing rankings, highlighting a key difference between their knowledge organization and human reasoning. These findings shed light on the limitations of LLMs' internal knowledge integration and suggest directions for improvement. To facilitate further research, we release our code, dataset, and model outputs.
Paper Structure (19 sections, 4 figures, 1 table)

This paper contains 19 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Main experimental results. The squares and diamonds represent the initial and final accuracy, respectively, after receiving doubtful user feedback on the medal QA task, particularly for questions related to gold medals. The triangles represent the initial and final accuracy on the team QA task. The results suggest a significant performance gap between the two tasks, as well as a decrease in performance after receiving doubtful feedback. Detailed results are provided in Table \ref{['tab:full']} in Appendix \ref{['app:detailed-experimental']}.
  • Figure 2: Doubt matrix for Claude-3.5-Sonnet on the medal QA task, specifically for predicting the number of gold medals. The matrix shows the model's response changes after user doubt was expressed. Doubt matrices of other models are presented in Appendix \ref{['app:detailed-doubt']}.
  • Figure :
  • Figure :