The Riddle of Reflection: Evaluating Reasoning and Self-Awareness in Multilingual LLMs using Indian Riddles
Abhinav P M, Ojasva Saxena, Oswald C, Parameswari Krishnamurthy
TL;DR
This study evaluates reasoning and self-awareness of large language models across seven Indian languages using a multilingual riddle dataset. It introduces context-reconstructed riddles and a three-phase methodology: model rating, dataset generation, and two-stage evaluation (riddle-solving and self-evaluation) across seven prompting strategies. Results show Gemini 2.5 Pro as the strongest solver, yet overconfident in its answers, while weaker models demonstrate greater self-awareness; performance varies significantly by language and task. The findings highlight persistent gaps in culturally grounded multilingual reasoning and underscore the need for reflective evaluation frameworks and approaches to enhance self-awareness in multilingual LLMs.
Abstract
The extent to which large language models (LLMs) can perform culturally grounded reasoning across non-English languages remains underexplored. This paper examines the reasoning and self-assessment abilities of LLMs across seven major Indian languages-Bengali, Gujarati, Hindi, Kannada, Malayalam, Tamil, and Telugu. We introduce a multilingual riddle dataset combining traditional riddles with context-reconstructed variants and evaluate five LLMs-Gemini 2.5 Pro, Gemini 2.5 Flash, Mistral-Saba, LLaMA 4 Scout, and LLaMA 4 Maverick-under seven prompting strategies. In the first stage, we assess riddle-solving performance and find that while Gemini 2.5 Pro performs best overall, few-shot methods yield only marginal gains, and accuracy varies notably across languages. In the second stage, we conduct a self-evaluation experiment to measure reasoning consistency. The results reveal a key finding: a model's initial accuracy is inversely correlated with its ability to identify its own mistakes. Top-performing models such as Gemini 2.5 Pro are overconfident (4.34% True Negative Rate), whereas lower-performing models like LLaMA 4 Scout are substantially more self-aware (42.09% True Negative Rate). These results point to clear gaps in multilingual reasoning and highlight the need for models that not only reason effectively but also recognize their own limitations.
