Table of Contents
Fetching ...

Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation

SeongYeub Chu, JongWoo Kim, MunYong Yi

TL;DR

InteractEval advances text evaluation by uniting human and LLM Think Aloud outputs to create fine-grained checklists for four dimensions (Coherence, Fluency, Consistency, Relevance). Through a three-stage pipeline—TA attribute collection, component-based checklist construction, and evaluator LLM scoring—the framework demonstrates superior correlation with ground-truth human scores on SummEval and generalizes to ELLIPSE, with Comb-TA (human+LLM) delivering the strongest results. The findings show humans excel at internal quality (structure, readability) while LLMs excel at external alignment (consistency with source, relevance), and that their integration yields enhanced performance and diversity of attributes. Practically, InteractEval offers a scalable, cost-effective method for robust evaluation of generated text and can inform future human–AI collaboration in quality assessment tasks.

Abstract

This study introduces \textbf{InteractEval}, a framework that integrates human expertise and Large Language Models (LLMs) using the Think-Aloud (TA) method to generate attributes for checklist-based text evaluation. By combining human flexibility and reasoning with LLM consistency, InteractEval outperforms traditional non-LLM-based and LLM-based baselines across four distinct dimensions, consisting of Coherence, Fluency, Consistency, and Relevance. The experiment also investigates the effectiveness of the TA method, showing that it promotes divergent thinking in both humans and LLMs, leading to the generation of a wider range of relevant attributes and enhance text evaluation performance. Comparative analysis reveals that humans excel at identifying attributes related to internal quality (Coherence and Fluency), but LLMs perform better at those attributes related to external alignment (Consistency and Relevance). Consequently, leveraging both humans and LLMs together produces the best evaluation outcomes. In other words, this study emphasizes the necessity of effectively combining humans and LLMs in an automated checklist-based text evaluation framework. The code is available at \textbf{\url{https://github.com/BBeeChu/InteractEval.git}}.

Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation

TL;DR

InteractEval advances text evaluation by uniting human and LLM Think Aloud outputs to create fine-grained checklists for four dimensions (Coherence, Fluency, Consistency, Relevance). Through a three-stage pipeline—TA attribute collection, component-based checklist construction, and evaluator LLM scoring—the framework demonstrates superior correlation with ground-truth human scores on SummEval and generalizes to ELLIPSE, with Comb-TA (human+LLM) delivering the strongest results. The findings show humans excel at internal quality (structure, readability) while LLMs excel at external alignment (consistency with source, relevance), and that their integration yields enhanced performance and diversity of attributes. Practically, InteractEval offers a scalable, cost-effective method for robust evaluation of generated text and can inform future human–AI collaboration in quality assessment tasks.

Abstract

This study introduces \textbf{InteractEval}, a framework that integrates human expertise and Large Language Models (LLMs) using the Think-Aloud (TA) method to generate attributes for checklist-based text evaluation. By combining human flexibility and reasoning with LLM consistency, InteractEval outperforms traditional non-LLM-based and LLM-based baselines across four distinct dimensions, consisting of Coherence, Fluency, Consistency, and Relevance. The experiment also investigates the effectiveness of the TA method, showing that it promotes divergent thinking in both humans and LLMs, leading to the generation of a wider range of relevant attributes and enhance text evaluation performance. Comparative analysis reveals that humans excel at identifying attributes related to internal quality (Coherence and Fluency), but LLMs perform better at those attributes related to external alignment (Consistency and Relevance). Consequently, leveraging both humans and LLMs together produces the best evaluation outcomes. In other words, this study emphasizes the necessity of effectively combining humans and LLMs in an automated checklist-based text evaluation framework. The code is available at \textbf{\url{https://github.com/BBeeChu/InteractEval.git}}.
Paper Structure (46 sections, 1 equation, 12 figures, 24 tables)

This paper contains 46 sections, 1 equation, 12 figures, 24 tables.

Figures (12)

  • Figure 1: InteractEval Framework for Text Evaluation: (A) Think Aloud: Human experts verbalize their thoughts and LLMs articulate their knowledge to generate text attributes insights using sample texts and evaluation rubrics. (B) Checklist Construction: Insights are combined and categorized into key components, leading to the generation and validation of checklist questions. (C) Checklist-based Evaluation: The checklists are answered by an evaluator LLM to evaluate the summaries, with results aggregated into a final score, which is then checked against a ground-truth score.
  • Figure 2: Text Attribute Generation Process: Human experts and LLMs are provided with a system message, a target dimension rubric, sample texts for the dimension, and their corresponding scores. They then review the materials, offering their ideas regarding text attributes that should be considered when rating the dimension based on their knowledge. Technically speaking, humans verbalize text attributes aloud through the process of thinking, whereas large language models (LLMs) generate text attributes after processing input prompt.
  • Figure 3: The Overall Process of Text Evaluation: Stage 1: Four humans and four LLMs participate in think-aloud-based text attribute generation, respectively. Stage 2: GPT-4 is solely utilized to construct the checklists. Stage 3: Either GPT-4 or GPT-3.5-Turbo is used to evaluate the quality of the summaries, which are presented with the checklists to guide the evaluation process.
  • Figure 4: Comparison of Distributions across Four Dimensions in SummEval Dataset: Entire Dataset refers to the distribution of the entire dataset with respect to each dimension, while 1st Sampled Dataset and 2nd Sampled Dataset represent the distributions of subsets obtained by sampling 10% of the entire dataset.
  • Figure 5: Performance of InteractEval across Different Component Numbers: Two correlation measures (y-axis) are compared across the four evaluation dimensions—Coherence, Fluency, Consistency, and Relevance—by varying the number of components (x-axis) extracted from Comb-TA attributes. The reported correlation values represent the average results from the first and second trials.
  • ...and 7 more figures