Table of Contents
Fetching ...

RefuteBench 2.0 -- Agentic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction

Jianhao Yan, Yun Luo, Yue Zhang

TL;DR

RefuteBench 2.0 introduces an agent-based dynamic evaluation framework to assess how LLMs incorporate user refutations, featuring LLM refuters and evaluators and supporting both transient and persistent refutation scenarios. The approach extends RefuteBench 1.0 by moving from template-driven templates to flexible, context-aware refutations and adding transient refutation evaluation, with seed data from MT, XSum, and Mage-derived tasks. Meta-evaluation demonstrates that LLM-based refuters produce more human-like refutations and evaluators correlate well with human judgments, particularly GPT-o1-mini achieving a Pearson correlation of 0.79. Experimental results across GPT-4o, Claude-3.5-Sonnet, Mixtral, Qwen, Gemma, and LLaMA models reveal that while models can satisfy refutations in isolated turns, they struggle to memorize past refutations and maintain fidelity to the initial task as dialogue length increases, highlighting attention and memory limitations in long-context interactions.

Abstract

In the multi-turn interaction schema, large language models (LLMs) can leverage user feedback to enhance the quality and relevance of their responses. However, evaluating an LLM's ability to incorporate user refutation feedback is crucial yet challenging. In this study, we introduce RefuteBench 2.0, which significantly extends the original RefuteBench by incorporating LLM agents as refuters and evaluators, which allows for flexible and comprehensive assessment. We design both transient and persistent refutation instructions with different validity periods. Meta-evaluation shows that the LLM-based refuter could generate more human-like refutations and the evaluators could assign scores with high correlation with humans. Experimental results of various LLMs show that current models could effectively satisfy the refutation but fail to memorize the refutation information. Interestingly, we also observe that the performance of the initial task decreases as the refutations increase. Analysis of the attention scores further shows a potential weakness of current LLMs: they struggle to retain and correctly use previous information during long context dialogues. https://github.com/ElliottYan/RefuteBench-2.0

RefuteBench 2.0 -- Agentic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction

TL;DR

RefuteBench 2.0 introduces an agent-based dynamic evaluation framework to assess how LLMs incorporate user refutations, featuring LLM refuters and evaluators and supporting both transient and persistent refutation scenarios. The approach extends RefuteBench 1.0 by moving from template-driven templates to flexible, context-aware refutations and adding transient refutation evaluation, with seed data from MT, XSum, and Mage-derived tasks. Meta-evaluation demonstrates that LLM-based refuters produce more human-like refutations and evaluators correlate well with human judgments, particularly GPT-o1-mini achieving a Pearson correlation of 0.79. Experimental results across GPT-4o, Claude-3.5-Sonnet, Mixtral, Qwen, Gemma, and LLaMA models reveal that while models can satisfy refutations in isolated turns, they struggle to memorize past refutations and maintain fidelity to the initial task as dialogue length increases, highlighting attention and memory limitations in long-context interactions.

Abstract

In the multi-turn interaction schema, large language models (LLMs) can leverage user feedback to enhance the quality and relevance of their responses. However, evaluating an LLM's ability to incorporate user refutation feedback is crucial yet challenging. In this study, we introduce RefuteBench 2.0, which significantly extends the original RefuteBench by incorporating LLM agents as refuters and evaluators, which allows for flexible and comprehensive assessment. We design both transient and persistent refutation instructions with different validity periods. Meta-evaluation shows that the LLM-based refuter could generate more human-like refutations and the evaluators could assign scores with high correlation with humans. Experimental results of various LLMs show that current models could effectively satisfy the refutation but fail to memorize the refutation information. Interestingly, we also observe that the performance of the initial task decreases as the refutations increase. Analysis of the attention scores further shows a potential weakness of current LLMs: they struggle to retain and correctly use previous information during long context dialogues. https://github.com/ElliottYan/RefuteBench-2.0

Paper Structure

This paper contains 26 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Machine translation instance for a query-feedback-response cycle, where a user provides a refutation instruction to modify the translation results, and the model responds with new results.
  • Figure 2: Examples of Transient Refutation and Persistent Refutation by machine translation.
  • Figure 3: The refuter prompt in RefuteBench 2.0. FOCUS is randomly chosen from 'style', 'word usage', and 'phrase usage'. QUERY is the initial query and RESPONSE is the model's previous response. We also inform the refuter with its previous refutations with REFUTATION_i, to avoid duplicate or conflicting refutations.
  • Figure 4: The evaluation prompt in RefuteBench2.0. QUERY, PPEV_ ANSWER, REFUTATION and NEW_ANSWER refer to the initial query, the response in the last turn, the refutation instruction and the new response in the current turn, respectively.
  • Figure 5: The correlation between different evaluator performance and human annotations.
  • ...and 4 more figures