Table of Contents
Fetching ...

The Riddle of Reflection: Evaluating Reasoning and Self-Awareness in Multilingual LLMs using Indian Riddles

Abhinav P M, Ojasva Saxena, Oswald C, Parameswari Krishnamurthy

TL;DR

This study evaluates reasoning and self-awareness of large language models across seven Indian languages using a multilingual riddle dataset. It introduces context-reconstructed riddles and a three-phase methodology: model rating, dataset generation, and two-stage evaluation (riddle-solving and self-evaluation) across seven prompting strategies. Results show Gemini 2.5 Pro as the strongest solver, yet overconfident in its answers, while weaker models demonstrate greater self-awareness; performance varies significantly by language and task. The findings highlight persistent gaps in culturally grounded multilingual reasoning and underscore the need for reflective evaluation frameworks and approaches to enhance self-awareness in multilingual LLMs.

Abstract

The extent to which large language models (LLMs) can perform culturally grounded reasoning across non-English languages remains underexplored. This paper examines the reasoning and self-assessment abilities of LLMs across seven major Indian languages-Bengali, Gujarati, Hindi, Kannada, Malayalam, Tamil, and Telugu. We introduce a multilingual riddle dataset combining traditional riddles with context-reconstructed variants and evaluate five LLMs-Gemini 2.5 Pro, Gemini 2.5 Flash, Mistral-Saba, LLaMA 4 Scout, and LLaMA 4 Maverick-under seven prompting strategies. In the first stage, we assess riddle-solving performance and find that while Gemini 2.5 Pro performs best overall, few-shot methods yield only marginal gains, and accuracy varies notably across languages. In the second stage, we conduct a self-evaluation experiment to measure reasoning consistency. The results reveal a key finding: a model's initial accuracy is inversely correlated with its ability to identify its own mistakes. Top-performing models such as Gemini 2.5 Pro are overconfident (4.34% True Negative Rate), whereas lower-performing models like LLaMA 4 Scout are substantially more self-aware (42.09% True Negative Rate). These results point to clear gaps in multilingual reasoning and highlight the need for models that not only reason effectively but also recognize their own limitations.

The Riddle of Reflection: Evaluating Reasoning and Self-Awareness in Multilingual LLMs using Indian Riddles

TL;DR

This study evaluates reasoning and self-awareness of large language models across seven Indian languages using a multilingual riddle dataset. It introduces context-reconstructed riddles and a three-phase methodology: model rating, dataset generation, and two-stage evaluation (riddle-solving and self-evaluation) across seven prompting strategies. Results show Gemini 2.5 Pro as the strongest solver, yet overconfident in its answers, while weaker models demonstrate greater self-awareness; performance varies significantly by language and task. The findings highlight persistent gaps in culturally grounded multilingual reasoning and underscore the need for reflective evaluation frameworks and approaches to enhance self-awareness in multilingual LLMs.

Abstract

The extent to which large language models (LLMs) can perform culturally grounded reasoning across non-English languages remains underexplored. This paper examines the reasoning and self-assessment abilities of LLMs across seven major Indian languages-Bengali, Gujarati, Hindi, Kannada, Malayalam, Tamil, and Telugu. We introduce a multilingual riddle dataset combining traditional riddles with context-reconstructed variants and evaluate five LLMs-Gemini 2.5 Pro, Gemini 2.5 Flash, Mistral-Saba, LLaMA 4 Scout, and LLaMA 4 Maverick-under seven prompting strategies. In the first stage, we assess riddle-solving performance and find that while Gemini 2.5 Pro performs best overall, few-shot methods yield only marginal gains, and accuracy varies notably across languages. In the second stage, we conduct a self-evaluation experiment to measure reasoning consistency. The results reveal a key finding: a model's initial accuracy is inversely correlated with its ability to identify its own mistakes. Top-performing models such as Gemini 2.5 Pro are overconfident (4.34% True Negative Rate), whereas lower-performing models like LLaMA 4 Scout are substantially more self-aware (42.09% True Negative Rate). These results point to clear gaps in multilingual reasoning and highlight the need for models that not only reason effectively but also recognize their own limitations.

Paper Structure

This paper contains 21 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the comprehensive three-phase methodology, including LLM rating, riddle generation with validation, and a two-stage evaluation of riddle-solving and model self-awareness.
  • Figure 2: An original Hindi riddle (top) and two of its context-reconstructed variants. Note how the core reasoning pattern is maintained while the theme and answer are altered.
  • Figure 3: LLM Performance in Reconstructed Contextual Riddle Generation Across Indian Languages.
  • Figure 4: Average riddle-solving accuracy (%) and BERTScore F1 across seven prompting strategies for five LLMs. The plot illustrates the minor variation in model performance based on the prompting method.
  • Figure 5: Average riddle-solving accuracy (%) and BERTScore F1 across seven Indian languages for five LLMs. The plot illustrates the variation in model performance based on the target language.
  • ...and 1 more figures