Do LLMs and Humans Find the Same Questions Difficult? A Case Study on Japanese Quiz Answering
Naoya Sugiura, Kosuke Yamada, Yasuhiro Ogawa, Katsuhiko Toyama, Ryohei Sasano
TL;DR
The paper investigates whether quiz difficulty as perceived by humans aligns with difficulty for large language models in a Japanese buzzer-quiz context. It collects Japanese quiz data with human CR rates and evaluates multiple LLMs (GPT-4o, Swallow 70B, Sarashina2 70B) under full and partial question inputs, analyzing results across two perspectives: Wikipedia-entry presence and answer character type. Key findings show LLMs rely heavily on Wikipedia-like encyclopedic knowledge and struggle with numerically driven questions, while humans show less sensitivity to Wikipedia coverage; katakana answers are comparatively easier for LLMs, and numeral-based questions remain challenging. The study highlights distinct error patterns between LLMs and humans, informing deployment and evaluation of quiz-answering systems and suggesting directions for cross-language and modality-specific analyses.
Abstract
LLMs have achieved performance that surpasses humans in many NLP tasks. However, it remains unclear whether problems that are difficult for humans are also difficult for LLMs. This study investigates how the difficulty of quizzes in a buzzer setting differs between LLMs and humans. Specifically, we first collect Japanese quiz data including questions, answers, and correct response rate of humans, then prompted LLMs to answer the quizzes under several settings, and compare their correct answer rate to that of humans from two analytical perspectives. The experimental results showed that, compared to humans, LLMs struggle more with quizzes whose correct answers are not covered by Wikipedia entries, and also have difficulty with questions that require numerical answers.
