Table of Contents
Fetching ...

AbsPyramid: Benchmarking the Abstraction Ability of Language Models with a Unified Entailment Graph

Zhaowei Wang, Haochen Shi, Weiqi Wang, Tianqing Fang, Hongming Zhang, Sehyun Choi, Xin Liu, Yangqiu Song

TL;DR

AbsPyramid proposes a large-scale, open-domain abstraction benchmark by unifying noun-, verb-, and event-level entailment into a single 221K-tuple graph derived from ASER, enriched with WordNet taxonomy and LLM-generated abstractions, then validated via crowdsourcing. It provides two evaluation tasks—abstraction detection and generation—and conducts extensive experiments across PLMs, NLI models, LoRA-tuned LLMs, and API-based LLMs to reveal current limitations in abstraction understanding and generation. The results show that while fine-tuned models excel on nouns, abstraction for verbs and events remains challenging; nonetheless, finetuning on AbsPyramid enhances cross-domain transfer to datasets like Levy/Holt and AbstractATOMIC, underscoring the benchmark’s broad utility. Overall, AbsPyramid offers a scalable framework and strong empirical evidence for improving and evaluating abstraction capabilities in language models, with implications for cross-domain generalization and reasoning in open-domain AI systems.

Abstract

Cognitive research indicates that abstraction ability is essential in human intelligence, which remains under-explored in language models. In this paper, we present AbsPyramid, a unified entailment graph of 221K textual descriptions of abstraction knowledge. While existing resources only touch nouns or verbs within simplified events or specific domains, AbsPyramid collects abstract knowledge for three components of diverse events to comprehensively evaluate the abstraction ability of language models in the open domain. Experimental results demonstrate that current LLMs face challenges comprehending abstraction knowledge in zero-shot and few-shot settings. By training on our rich abstraction knowledge, we find LLMs can acquire basic abstraction abilities and generalize to unseen events. In the meantime, we empirically show that our benchmark is comprehensive to enhance LLMs across two previous abstraction tasks.

AbsPyramid: Benchmarking the Abstraction Ability of Language Models with a Unified Entailment Graph

TL;DR

AbsPyramid proposes a large-scale, open-domain abstraction benchmark by unifying noun-, verb-, and event-level entailment into a single 221K-tuple graph derived from ASER, enriched with WordNet taxonomy and LLM-generated abstractions, then validated via crowdsourcing. It provides two evaluation tasks—abstraction detection and generation—and conducts extensive experiments across PLMs, NLI models, LoRA-tuned LLMs, and API-based LLMs to reveal current limitations in abstraction understanding and generation. The results show that while fine-tuned models excel on nouns, abstraction for verbs and events remains challenging; nonetheless, finetuning on AbsPyramid enhances cross-domain transfer to datasets like Levy/Holt and AbstractATOMIC, underscoring the benchmark’s broad utility. Overall, AbsPyramid offers a scalable framework and strong empirical evidence for improving and evaluating abstraction capabilities in language models, with implications for cross-domain generalization and reasoning in open-domain AI systems.

Abstract

Cognitive research indicates that abstraction ability is essential in human intelligence, which remains under-explored in language models. In this paper, we present AbsPyramid, a unified entailment graph of 221K textual descriptions of abstraction knowledge. While existing resources only touch nouns or verbs within simplified events or specific domains, AbsPyramid collects abstract knowledge for three components of diverse events to comprehensively evaluate the abstraction ability of language models in the open domain. Experimental results demonstrate that current LLMs face challenges comprehending abstraction knowledge in zero-shot and few-shot settings. By training on our rich abstraction knowledge, we find LLMs can acquire basic abstraction abilities and generalize to unseen events. In the meantime, we empirically show that our benchmark is comprehensive to enhance LLMs across two previous abstraction tasks.
Paper Structure (45 sections, 6 figures, 23 tables)

This paper contains 45 sections, 6 figures, 23 tables.

Figures (6)

  • Figure 1: An illustration of our AbsPyramid benchmark. We identify three components of events (i.e., Noun, Verb, and Event as a whole) and collect abstract concepts entailed by them.
  • Figure 2: An illustration of the structure of abstraction knowledge, where entailment relation is Noun-Entail.
  • Figure 3: Error Analysis. We find hallucinations within zero-shot CoT of ChatGPT with correct explanations but wrong conclusions.
  • Figure 4: The fine-tuning performance on the Levy/Holt dataset. CF stands for continually fine-tuning.
  • Figure 5: Few-shot performance on AbstractATOMIC. CF stands for continually fine-tuning.
  • ...and 1 more figures