Table of Contents
Fetching ...

FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback

Youquan Li, Miao Zheng, Fan Yang, Guosheng Dong, Bin Cui, Weipeng Chen, Zenan Zhou, Wentao Zhang

TL;DR

FB-Bench introduces a fine-grained, multi-task benchmark to evaluate how LLMs respond to human feedback in realistic, two-turn interaction scenarios within Chinese usage. It builds a three-tier taxonomy (query task, model response, user feedback) and uses a GPT-based judge with a weighted checklist to enable fine-grained evaluation, yielding $591$ curated samples across eight tasks, five deficiencies, and nine feedback types. Experimental results across $27$ LLMs show a narrowing gap between open-source and closed-source models, with stronger models excelling in both error correction and response maintenance, and hints significantly boosting quality while misinformation degrades performance. The work provides actionable insights into how task types and feedback modalities shape responsiveness and offers directions for safer, more reliable human-LLM interactions, though it is currently limited to Chinese data and evaluation via LLM-based judges.

Abstract

Human feedback is crucial in the interactions between humans and Large Language Models (LLMs). However, existing research primarily focuses on benchmarking LLMs in single-turn dialogues. Even in benchmarks designed for multi-turn dialogues, the user inputs are often independent, neglecting the nuanced and complex nature of human feedback within real-world usage scenarios. To fill this research gap, we introduce FB-Bench, a fine-grained, multi-task benchmark designed to evaluate LLMs' responsiveness to human feedback under real-world usage scenarios in Chinese. Drawing from the two main interaction scenarios, FB-Bench comprises 591 meticulously curated samples, encompassing eight task types, five deficiency types of response, and nine feedback types. We extensively evaluate a broad array of popular LLMs, revealing significant variations in their performance across different interaction scenarios. Further analysis indicates that task, human feedback, and deficiencies of previous responses can also significantly impact LLMs' responsiveness. Our findings underscore both the strengths and limitations of current models, providing valuable insights and directions for future research. Code and datasets are available at https://github.com/PKU-Baichuan-MLSystemLab/FB-Bench.

FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback

TL;DR

FB-Bench introduces a fine-grained, multi-task benchmark to evaluate how LLMs respond to human feedback in realistic, two-turn interaction scenarios within Chinese usage. It builds a three-tier taxonomy (query task, model response, user feedback) and uses a GPT-based judge with a weighted checklist to enable fine-grained evaluation, yielding curated samples across eight tasks, five deficiencies, and nine feedback types. Experimental results across LLMs show a narrowing gap between open-source and closed-source models, with stronger models excelling in both error correction and response maintenance, and hints significantly boosting quality while misinformation degrades performance. The work provides actionable insights into how task types and feedback modalities shape responsiveness and offers directions for safer, more reliable human-LLM interactions, though it is currently limited to Chinese data and evaluation via LLM-based judges.

Abstract

Human feedback is crucial in the interactions between humans and Large Language Models (LLMs). However, existing research primarily focuses on benchmarking LLMs in single-turn dialogues. Even in benchmarks designed for multi-turn dialogues, the user inputs are often independent, neglecting the nuanced and complex nature of human feedback within real-world usage scenarios. To fill this research gap, we introduce FB-Bench, a fine-grained, multi-task benchmark designed to evaluate LLMs' responsiveness to human feedback under real-world usage scenarios in Chinese. Drawing from the two main interaction scenarios, FB-Bench comprises 591 meticulously curated samples, encompassing eight task types, five deficiency types of response, and nine feedback types. We extensively evaluate a broad array of popular LLMs, revealing significant variations in their performance across different interaction scenarios. Further analysis indicates that task, human feedback, and deficiencies of previous responses can also significantly impact LLMs' responsiveness. Our findings underscore both the strengths and limitations of current models, providing valuable insights and directions for future research. Code and datasets are available at https://github.com/PKU-Baichuan-MLSystemLab/FB-Bench.

Paper Structure

This paper contains 40 sections, 2 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: LLMs proficient in single-turn interactions might not handle user feedback well (left), while those not great at single-turn can excel in correcting their previous errors by using feedback effectively (right).
  • Figure 2: Overview of FB-Bench. (1)Data Curation: A human-LLM synergy pipeline for mining target data from real-world scenarios and improving their quality and diversity. (2)Three-tier Hierarchical Taxonomy: Comprising 8 popular task types, 5 deficiency types and 9 feedback types, derived from two interaction scenarios. (3)Auto-Evaluation: A LLM-as-a-Judge framework to automatically evaluate LLM's response with a weighted checklist.
  • Figure 3: FB-Bench Statistics.
  • Figure 4: The subset evaluation results in FB-Bench between error correction and response maintenance scenarios. Overall denotes the mean of error correction score and response maintenance score. The dashed line represents the diagonal $y=x$.
  • Figure 5: The performance of top four LLMs across eight popular tasks
  • ...and 8 more figures