THiNK: Can Large Language Models Think-aloud?

Yongan Yu; Mengqian Wu; Yiran Lin; Nikki G. Lobczowski

THiNK: Can Large Language Models Think-aloud?

Yongan Yu, Mengqian Wu, Yiran Lin, Nikki G. Lobczowski

TL;DR

THiNK addresses the challenge of measuring higher-order thinking in LLMs by turning reasoning into an iterative think-aloud process for refining math word problems. It employs a multi-agent evaluation anchored in Bloom's taxonomy to generate, critique, and revise items, guided by structured feedback. The evaluation defines a composite quality score $Q(p_i)=0.5\,PR(p_i)+0.3\,AA(p_i)+0.2\,AC(p_i)$ with a success threshold $Q(p_i)>85$, and uses a holistic agent to provide targeted improvements. Across seven LLMs, THiNK reveals a HOT skills gap that is mitigated by feedback loops, and the qualitative analysis shows THiNK-guided outputs align better with domain logic; the work provides a scalable methodology and open-source code for probing and advancing LLM reasoning in education.

Abstract

Assessing higher-order thinking skills in large language models (LLMs) remains a fundamental challenge, especially in tasks that go beyond surface-level accuracy. In this work, we propose THiNK (Testing Higher-order Notion of Knowledge), a multi-agent, feedback-driven evaluation framework grounded in Bloom's Taxonomy. THiNK frames reasoning assessment as an iterative task of problem generation, critique, and revision, encouraging LLMs to think-aloud through step-by-step reflection and refinement. This enables a systematic evaluation of both lower-order (e.g., remember, understand) and higher-order (e.g., evaluate, create) thinking skills. We apply THiNK to seven state-of-the-art LLMs and perform a detailed cognitive analysis of their outputs. Results reveal that while models reliably perform lower-order categories well, they struggle with applying knowledge in realistic contexts and exhibit limited abstraction. Structured feedback loops significantly improve reasoning performance, particularly in higher-order thinking. Qualitative evaluations further confirm that THiNK-guided outputs better align with domain logic and problem structure. The code of our framework provides a scalable methodology for probing and enhancing LLM reasoning, offering new directions for evaluation grounded in learning science, which is available at our GitHub repository.

THiNK: Can Large Language Models Think-aloud?

TL;DR

Abstract

THiNK: Can Large Language Models Think-aloud?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)