Table of Contents
Fetching ...

AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses

Xiaotian Lu, Jiyi Li, Koh Takeuchi, Hisashi Kashima

TL;DR

This study proposes a method that leverages LLMs and the analytic hierarchy process (AHP) to assess answers to open-ended questions and results indicate that this approach more closely aligns with human judgment compared to the four baselines.

Abstract

Question answering (QA) tasks have been extensively studied in the field of natural language processing (NLP). Answers to open-ended questions are highly diverse and difficult to quantify, and cannot be simply evaluated as correct or incorrect, unlike close-ended questions with definitive answers. While large language models (LLMs) have demonstrated strong capabilities across various tasks, they exhibit relatively weaker performance in evaluating answers to open-ended questions. In this study, we propose a method that leverages LLMs and the analytic hierarchy process (AHP) to assess answers to open-ended questions. We utilized LLMs to generate multiple evaluation criteria for a question. Subsequently, answers were subjected to pairwise comparisons under each criterion with LLMs, and scores for each answer were calculated in the AHP. We conducted experiments on four datasets using both ChatGPT-3.5-turbo and GPT-4. Our results indicate that our approach more closely aligns with human judgment compared to the four baselines. Additionally, we explored the impact of the number of criteria, variations in models, and differences in datasets on the results.

AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses

TL;DR

This study proposes a method that leverages LLMs and the analytic hierarchy process (AHP) to assess answers to open-ended questions and results indicate that this approach more closely aligns with human judgment compared to the four baselines.

Abstract

Question answering (QA) tasks have been extensively studied in the field of natural language processing (NLP). Answers to open-ended questions are highly diverse and difficult to quantify, and cannot be simply evaluated as correct or incorrect, unlike close-ended questions with definitive answers. While large language models (LLMs) have demonstrated strong capabilities across various tasks, they exhibit relatively weaker performance in evaluating answers to open-ended questions. In this study, we propose a method that leverages LLMs and the analytic hierarchy process (AHP) to assess answers to open-ended questions. We utilized LLMs to generate multiple evaluation criteria for a question. Subsequently, answers were subjected to pairwise comparisons under each criterion with LLMs, and scores for each answer were calculated in the AHP. We conducted experiments on four datasets using both ChatGPT-3.5-turbo and GPT-4. Our results indicate that our approach more closely aligns with human judgment compared to the four baselines. Additionally, we explored the impact of the number of criteria, variations in models, and differences in datasets on the results.
Paper Structure (12 sections, 8 equations, 6 figures, 5 tables)

This paper contains 12 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: An example of using AHP to choose a restaurant. Three criteria: food, service, and price are used to decide which restaurant to choose. Each criterion has a weight representing its importance, and each restaurant has a score under each criterion, all obtained through pairwise comparison. The final overall scores can be obtained through matrix multiplication.
  • Figure 2: Our proposed AHP-Powered LLM evaluation.
  • Figure 3: Histograms of Scoring evaluation. The left of the '/' in the subtitle corresponds to the dataset, and the "3.5" on the right refers to ChatGPT3.5-turbo, while "4" refers to GPT-4. We will use the same notation in other figures. The horizontal and vertical axes represent scores and number of responses, respectively. It is shown that LLMs tend to assign mid to high scores to answers, which leads to a lack of differentiation and worsens the results. GPT-4 performs slightly better than ChatGPT-3.5 but still falls short of being satisfactory.
  • Figure 4: Histograms of Few-shot evaluation. The horizontal and vertical axes represent the level and number of responses, respectively. It is shown that LLMs have almost no ability to learn from a small number of samples in complex open-ended questions. In most cases, LLMs tend to assign mid to high levels, while rarely assigning the highest or lowest levels.
  • Figure 5: Histograms of CERF level evaluation. The horizontal and vertical axes represent the level and number of responses, respectively. It is shown that LLMs lack the ability to learn level definitions, tending to assign most articles a level of 2 or 3.
  • ...and 1 more figures