Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation
SeongYeub Chu, JongWoo Kim, MunYong Yi
TL;DR
InteractEval advances text evaluation by uniting human and LLM Think Aloud outputs to create fine-grained checklists for four dimensions (Coherence, Fluency, Consistency, Relevance). Through a three-stage pipeline—TA attribute collection, component-based checklist construction, and evaluator LLM scoring—the framework demonstrates superior correlation with ground-truth human scores on SummEval and generalizes to ELLIPSE, with Comb-TA (human+LLM) delivering the strongest results. The findings show humans excel at internal quality (structure, readability) while LLMs excel at external alignment (consistency with source, relevance), and that their integration yields enhanced performance and diversity of attributes. Practically, InteractEval offers a scalable, cost-effective method for robust evaluation of generated text and can inform future human–AI collaboration in quality assessment tasks.
Abstract
This study introduces \textbf{InteractEval}, a framework that integrates human expertise and Large Language Models (LLMs) using the Think-Aloud (TA) method to generate attributes for checklist-based text evaluation. By combining human flexibility and reasoning with LLM consistency, InteractEval outperforms traditional non-LLM-based and LLM-based baselines across four distinct dimensions, consisting of Coherence, Fluency, Consistency, and Relevance. The experiment also investigates the effectiveness of the TA method, showing that it promotes divergent thinking in both humans and LLMs, leading to the generation of a wider range of relevant attributes and enhance text evaluation performance. Comparative analysis reveals that humans excel at identifying attributes related to internal quality (Coherence and Fluency), but LLMs perform better at those attributes related to external alignment (Consistency and Relevance). Consequently, leveraging both humans and LLMs together produces the best evaluation outcomes. In other words, this study emphasizes the necessity of effectively combining humans and LLMs in an automated checklist-based text evaluation framework. The code is available at \textbf{\url{https://github.com/BBeeChu/InteractEval.git}}.
