Table of Contents
Fetching ...

A Dataset for Evaluating LLM-based Evaluation Functions for Research Question Extraction Task

Yuya Fujisaki, Shiro Takagi, Hideki Asoh, Wataru Kumagai

TL;DR

This paper introduces PaperRQ-HumanAnno-Dataset, a dataset pairing ACL machine-learning papers with RQs extracted by GPT-4 and human judgments across three evaluation perspectives. By analyzing correlations between various LLM-based evaluation functions and human scores, the authors demonstrate that existing functions do not reliably align with human judgment for RQ quality, underscoring the need for RQ-specific evaluation methods. The work reveals that modeling the evaluation procedure yields the most promising improvements, while simple prompt-based approaches and increased token counts offer limited gains. The dataset provides a foundation for developing better domain-specific evaluators, with implications for improving automatic RQ extraction and, more broadly, AI-assisted scholarly analysis.

Abstract

The progress in text summarization techniques has been remarkable. However the task of accurately extracting and summarizing necessary information from highly specialized documents such as research papers has not been sufficiently investigated. We are focusing on the task of extracting research questions (RQ) from research papers and construct a new dataset consisting of machine learning papers, RQ extracted from these papers by GPT-4, and human evaluations of the extracted RQ from multiple perspectives. Using this dataset, we systematically compared recently proposed LLM-based evaluation functions for summarizations, and found that none of the functions showed sufficiently high correlations with human evaluations. We expect our dataset provides a foundation for further research on developing better evaluation functions tailored to the RQ extraction task, and contribute to enhance the performance of the task. The dataset is available at https://github.com/auto-res/PaperRQ-HumanAnno-Dataset.

A Dataset for Evaluating LLM-based Evaluation Functions for Research Question Extraction Task

TL;DR

This paper introduces PaperRQ-HumanAnno-Dataset, a dataset pairing ACL machine-learning papers with RQs extracted by GPT-4 and human judgments across three evaluation perspectives. By analyzing correlations between various LLM-based evaluation functions and human scores, the authors demonstrate that existing functions do not reliably align with human judgment for RQ quality, underscoring the need for RQ-specific evaluation methods. The work reveals that modeling the evaluation procedure yields the most promising improvements, while simple prompt-based approaches and increased token counts offer limited gains. The dataset provides a foundation for developing better domain-specific evaluators, with implications for improving automatic RQ extraction and, more broadly, AI-assisted scholarly analysis.

Abstract

The progress in text summarization techniques has been remarkable. However the task of accurately extracting and summarizing necessary information from highly specialized documents such as research papers has not been sufficiently investigated. We are focusing on the task of extracting research questions (RQ) from research papers and construct a new dataset consisting of machine learning papers, RQ extracted from these papers by GPT-4, and human evaluations of the extracted RQ from multiple perspectives. Using this dataset, we systematically compared recently proposed LLM-based evaluation functions for summarizations, and found that none of the functions showed sufficiently high correlations with human evaluations. We expect our dataset provides a foundation for further research on developing better evaluation functions tailored to the RQ extraction task, and contribute to enhance the performance of the task. The dataset is available at https://github.com/auto-res/PaperRQ-HumanAnno-Dataset.
Paper Structure (69 sections, 10 figures, 12 tables)

This paper contains 69 sections, 10 figures, 12 tables.

Figures (10)

  • Figure 1: This study has two main processes. First, we constructed a dataset consisting of papers, RQ extracted by an LLM, and human evaluation scores of the RQ quality based on the paper abstract and introduction. Second, using this dataset, we analyzed the correlation between the output scores of various LLM-based evaluation functions and human scores, and identified the evaluation function that is closest to human judgment. Through this series of processes, we confirmed the effectiveness of automatic evaluation of RQ using LLM.
  • Figure 2: Visualization of the overlap rate of RQ with mismatched evaluation values between methods, categorized by Method Score, as a correlation diagram.
  • Figure 3: Visualization of Spearman correlation coefficients using violin plots, comparing wang-etal-2023-chatgpt and chiang-lee-2023-closer to confirm the variability due to differences in sample count when the temperature is set to 1 for both methods. Visualization of Kendall correlation coefficients is shown in \ref{['sec:appdx Impact of sample count on result variability']}.
  • Figure 4: Visualization of Spearman correlation coefficients using violin plots, comparing the analyze-rate of chiang-lee-2023-closer to confirm the variability due to differences between gpt-4-turbo-2024-04-09 and gpt-4o-2024-05-13 when the temperature is set to 1 for both. Visualization of Kendall correlation coefficients is shown in \ref{['sec:appdx Variability of results due to model differences']}.
  • Figure 5: Graph visualizing the average scores of all annotators for each prompt used to extract RQ, categorized by Problem Score, Method Score, and Is Target RQ Type.
  • ...and 5 more figures