Table of Contents
Fetching ...

THiNK: Can Large Language Models Think-aloud?

Yongan Yu, Mengqian Wu, Yiran Lin, Nikki G. Lobczowski

TL;DR

THiNK addresses the challenge of measuring higher-order thinking in LLMs by turning reasoning into an iterative think-aloud process for refining math word problems. It employs a multi-agent evaluation anchored in Bloom's taxonomy to generate, critique, and revise items, guided by structured feedback. The evaluation defines a composite quality score $Q(p_i)=0.5\,PR(p_i)+0.3\,AA(p_i)+0.2\,AC(p_i)$ with a success threshold $Q(p_i)>85$, and uses a holistic agent to provide targeted improvements. Across seven LLMs, THiNK reveals a HOT skills gap that is mitigated by feedback loops, and the qualitative analysis shows THiNK-guided outputs align better with domain logic; the work provides a scalable methodology and open-source code for probing and advancing LLM reasoning in education.

Abstract

Assessing higher-order thinking skills in large language models (LLMs) remains a fundamental challenge, especially in tasks that go beyond surface-level accuracy. In this work, we propose THiNK (Testing Higher-order Notion of Knowledge), a multi-agent, feedback-driven evaluation framework grounded in Bloom's Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to think-aloud through step-by-step reflection and refinement. This enables a systematic evaluation of both lower-order (e.g., remember, understand) and higher-order (e.g., evaluate, create) thinking skills. We apply THiNK to seven state-of-the-art LLMs and perform a detailed cognitive analysis of their outputs. Results reveal that while models reliably perform lower-order categories well, they struggle with applying knowledge in realistic contexts and exhibit limited abstraction. Structured feedback loops significantly improve reasoning performance, particularly in higher-order thinking. Qualitative evaluations further confirm that THiNK-guided outputs better align with domain logic and problem structure. The code of our framework provides a scalable methodology for probing and enhancing LLM reasoning, offering new directions for evaluation grounded in learning science, which is available at our GitHub repository.

THiNK: Can Large Language Models Think-aloud?

TL;DR

THiNK addresses the challenge of measuring higher-order thinking in LLMs by turning reasoning into an iterative think-aloud process for refining math word problems. It employs a multi-agent evaluation anchored in Bloom's taxonomy to generate, critique, and revise items, guided by structured feedback. The evaluation defines a composite quality score with a success threshold , and uses a holistic agent to provide targeted improvements. Across seven LLMs, THiNK reveals a HOT skills gap that is mitigated by feedback loops, and the qualitative analysis shows THiNK-guided outputs align better with domain logic; the work provides a scalable methodology and open-source code for probing and advancing LLM reasoning in education.

Abstract

Assessing higher-order thinking skills in large language models (LLMs) remains a fundamental challenge, especially in tasks that go beyond surface-level accuracy. In this work, we propose THiNK (Testing Higher-order Notion of Knowledge), a multi-agent, feedback-driven evaluation framework grounded in Bloom's Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to think-aloud through step-by-step reflection and refinement. This enables a systematic evaluation of both lower-order (e.g., remember, understand) and higher-order (e.g., evaluate, create) thinking skills. We apply THiNK to seven state-of-the-art LLMs and perform a detailed cognitive analysis of their outputs. Results reveal that while models reliably perform lower-order categories well, they struggle with applying knowledge in realistic contexts and exhibit limited abstraction. Structured feedback loops significantly improve reasoning performance, particularly in higher-order thinking. Qualitative evaluations further confirm that THiNK-guided outputs better align with domain logic and problem structure. The code of our framework provides a scalable methodology for probing and enhancing LLM reasoning, offering new directions for evaluation grounded in learning science, which is available at our GitHub repository.

Paper Structure

This paper contains 38 sections, 9 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: A figure shows the "think-aloud" process through iterative revision and reflection, a more robust assessment of HOT skills in LLMs.
  • Figure 2: Overview of the THiNK. The pipeline begins with flawed math problems that are iteratively refined. The core multi-agent evaluation stage uses six Bloom-aligned agents and one heuristic agent to assess quality, providing scores and targeted feedback. Guided by the "Five Keys" and prior suggestions, LLMs revise or generate new problems via a think-aloud process. A quality threshold determines success or triggers further refinement.
  • Figure 3: Comparison between HOT and LOT. The scale is the sum of scores across corresponding levels.