See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses

Yulong Chen; Yang Liu; Jianhao Yan; Xuefeng Bai; Ming Zhong; Yinghao Yang; Ziyi Yang; Chenguang Zhu; Yue Zhang

See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses

Yulong Chen, Yang Liu, Jianhao Yan, Xuefeng Bai, Ming Zhong, Yinghao Yang, Ziyi Yang, Chenguang Zhu, Yue Zhang

TL;DR

A Self-Challenge evaluation framework with human-in-the-loop to demonstrate that LLMs can autonomously identify their inherent flaws and provide insights for future dynamic and automatic evaluation.

Abstract

The impressive performance of Large Language Models (LLMs) has consistently surpassed numerous human-designed benchmarks, presenting new challenges in assessing the shortcomings of LLMs. Designing tasks and finding LLMs' limitations are becoming increasingly important. In this paper, we investigate the question of whether an LLM can discover its own limitations from the errors it makes. To this end, we propose a Self-Challenge evaluation framework with human-in-the-loop. Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances and incorporate human feedback on them to refine these patterns for generating more challenging data, iteratively. We end up with 8 diverse patterns, such as text manipulation and questions with assumptions. We then build a benchmark, SC-G4, consisting of 1,835 instances generated by GPT-4 using these patterns, with human-annotated gold responses. The SC-G4 serves as a challenging benchmark that allows for a detailed assessment of LLMs' abilities. Our results show that only 44.96\% of instances in SC-G4 can be answered correctly by GPT-4. Interestingly, our pilot study indicates that these error patterns also challenge other LLMs, such as Claude-3 and Llama-3, and cannot be fully resolved through fine-tuning. Our work takes the first step to demonstrate that LLMs can autonomously identify their inherent flaws and provide insights for future dynamic and automatic evaluation.

See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses

TL;DR

A Self-Challenge evaluation framework with human-in-the-loop to demonstrate that LLMs can autonomously identify their inherent flaws and provide insights for future dynamic and automatic evaluation.

Abstract

Paper Structure (47 sections, 5 equations, 3 figures, 11 tables)

This paper contains 47 sections, 5 equations, 3 figures, 11 tables.

Introduction
Self-Challenge Framework
Pattern Summarization
Pattern Evaluation
Pattern Optimization
Iterative Self-Challenge
Discovering Error Patterns in GPT-4 and Constructing SC-G4 Benchmark
Seed Instance Collection
Error Pattern Discovery
The SC-G4 Benchmark
Benchmarking LLMs on SC-G4 and Investigating Error Generalization across LLMs
Experimental Setup
Result and Analysis
Zero-shot Performance
Few-shot Performance
...and 32 more sections

Figures (3)

Figure 1: The overall Self-Challenge framework. We first summarize initial error patterns from seed failure instances (Step 1). Then, we perform pattern evaluation (Step 2) to determine whether summarized patterns can be used to generate challenging queries, and obtain corresponding human feedback; pattern optimization (Step 3) to modify the original pattern, making it more accurately describe challenging features, based on human feedback (the difference between the initial pattern and optimized pattern in is highlighted by underlined text). We frame Step 2 and Step 3 iteratively.
Figure 2: A case of optimized pattern, coupled with its original pattern. We highlight their difference. Full patterns can be found in \ref{['append:pattern']}.
Figure 3: Breakdown analysis of zero-shot performance under different individual domains. The IDs of domains are corresponding to IDs in \ref{['tab:wiki_title']} in \ref{['append:domain']}, respectively.

See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses

TL;DR

Abstract

See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses

Authors

TL;DR

Abstract

Table of Contents

Figures (3)