Table of Contents
Fetching ...

LLMs' Classification Performance is Overclaimed

Hanzi Xu, Renze Lou, Jiangshu Du, Vahid Mahzoon, Elmira Talebianaraki, Zhuoan Zhou, Elizabeth Garrison, Slobodan Vucetic, Wenpeng Yin

TL;DR

This work challenges the assumption that modern LLMs inherently understand classification tasks, showing that standard benchmarks can overstate their capabilities when gold labels are present. By introducing the Know-No benchmark and the OmniAccuracy metric, the authors assess LLM performance both with and without gold labels, thereby revealing substantial gaps between human and model discrimination, especially under Classify-w/o-Gold. The study demonstrates that even top-tier models exhibit inconsistent behavior, particularly in No-Hint prompts, and can fail to acknowledge the absence of a correct option, highlighting a need for evaluation frameworks that capture true task understanding. Collectively, Know-No and OmniAccuracy provide a practical, extensible approach for measuring robust, human-like classification cognition in LLMs, with implications for data leakage considerations and prompt design strategies in open- and closed-source settings.

Abstract

In many classification tasks designed for AI or human to solve, gold labels are typically included within the label space by default, often posed as "which of the following is correct?" This standard setup has traditionally highlighted the strong performance of advanced AI, particularly top-performing Large Language Models (LLMs), in routine classification tasks. However, when the gold label is intentionally excluded from the label space, it becomes evident that LLMs still attempt to select from the available label candidates, even when none are correct. This raises a pivotal question: Do LLMs truly demonstrate their intelligence in understanding the essence of classification tasks? In this study, we evaluate both closed-source and open-source LLMs across representative classification tasks, arguing that the perceived performance of LLMs is overstated due to their inability to exhibit the expected comprehension of the task. This paper makes a threefold contribution: i) To our knowledge, this is the first work to identify the limitations of LLMs in classification tasks when gold labels are absent. We define this task as Classify-w/o-Gold and propose it as a new testbed for LLMs. ii) We introduce a benchmark, Know-No, comprising two existing classification tasks and one new task, to evaluate Classify-w/o-Gold. iii) This work defines and advocates for a new evaluation metric, OmniAccuracy, which assesses LLMs' performance in classification tasks both when gold labels are present and absent.

LLMs' Classification Performance is Overclaimed

TL;DR

This work challenges the assumption that modern LLMs inherently understand classification tasks, showing that standard benchmarks can overstate their capabilities when gold labels are present. By introducing the Know-No benchmark and the OmniAccuracy metric, the authors assess LLM performance both with and without gold labels, thereby revealing substantial gaps between human and model discrimination, especially under Classify-w/o-Gold. The study demonstrates that even top-tier models exhibit inconsistent behavior, particularly in No-Hint prompts, and can fail to acknowledge the absence of a correct option, highlighting a need for evaluation frameworks that capture true task understanding. Collectively, Know-No and OmniAccuracy provide a practical, extensible approach for measuring robust, human-like classification cognition in LLMs, with implications for data leakage considerations and prompt design strategies in open- and closed-source settings.

Abstract

In many classification tasks designed for AI or human to solve, gold labels are typically included within the label space by default, often posed as "which of the following is correct?" This standard setup has traditionally highlighted the strong performance of advanced AI, particularly top-performing Large Language Models (LLMs), in routine classification tasks. However, when the gold label is intentionally excluded from the label space, it becomes evident that LLMs still attempt to select from the available label candidates, even when none are correct. This raises a pivotal question: Do LLMs truly demonstrate their intelligence in understanding the essence of classification tasks? In this study, we evaluate both closed-source and open-source LLMs across representative classification tasks, arguing that the perceived performance of LLMs is overstated due to their inability to exhibit the expected comprehension of the task. This paper makes a threefold contribution: i) To our knowledge, this is the first work to identify the limitations of LLMs in classification tasks when gold labels are absent. We define this task as Classify-w/o-Gold and propose it as a new testbed for LLMs. ii) We introduce a benchmark, Know-No, comprising two existing classification tasks and one new task, to evaluate Classify-w/o-Gold. iii) This work defines and advocates for a new evaluation metric, OmniAccuracy, which assesses LLMs' performance in classification tasks both when gold labels are present and absent.

Paper Structure

This paper contains 38 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Latest LLMs (GPT-4o, claude-3-opus, and gemini-1.5-pro as of June 29, 2024) vs. Human when the gold label is present or absent.
  • Figure 2: An example of EquInfer, where the equation labeled with "B" is correct.
  • Figure 3: Humans vs. LLMs on MC-Test.
  • Figure 4: LLMs' output pattern distribution in No-Hint on MC-Test.
  • Figure 5: Scaling the length of context around the equation in EquInfer