Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar
TL;DR
This work addresses the problem that final-task accuracy overlooks the quality of intermediate reasoning in multi-agent IR systems. It introduces a Thinker-Executor framework that decouples CoT generation from execution and proposes two metrics—reusability and verifiability—to evaluate the utility and clarity of CoT across a committee of Executors. Through experiments with four Thinkers and ten Executors on five benchmarks, the study finds that reusability and verifiability do not consistently track accuracy, and that general-purpose LLMs can outperform specialized models in these metrics. These findings suggest that current accuracy-based leaderboards inadequately capture reasoning capabilities and motivate integrating interaction-based metrics into evaluation and training to improve robustness in collaborative AI systems. The work also provides a publicly available pipeline and prompts to enable reproducibility and broader adoption of cross-model CoT evaluation, reinforcing the need for more holistic reasoning quality assessment in AI systems. $R(M_T,Q,M_E) = \frac{(|Q_{helped}| + |Q_{harmed}|) \times 100}{|Q_{correct}|}$ and $V = \frac{100}{|Q|} \sum_{q \in Q} \mathbb{I}(Ans(M_E,q, CoT_{M_T,q}) = Ans(M_T,q, CoT_{M_T,q}))$ formalize the core metrics.
Abstract
In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other. Current CoT evaluation narrowly focuses on target task accuracy. However, this metric fails to assess the quality or utility of the reasoning process itself. To address this limitation, we introduce two novel measures: reusability and verifiability. We decouple CoT generation from execution using a Thinker-Executor framework. Reusability measures how easily an Executor can reuse the Thinker's CoT. Verifiability measures how frequently an Executor can match the Thinker's answer using the CoT. We evaluated four Thinker models against a committee of ten Executor models across five benchmarks. Our results reveal that reusability and verifiability do not correlate with standard accuracy, exposing a blind spot in current accuracy-based leaderboards for reasoning capability. Surprisingly, we find that CoTs from specialized reasoning models are not consistently more reusable or verifiable than those from general-purpose LLMs like Llama and Gemma.
