COPEN: Probing Conceptual Knowledge in Pre-trained Language Models

Hao Peng; Xiaozhi Wang; Shengding Hu; Hailong Jin; Lei Hou; Juanzi Li; Zhiyuan Liu; Qun Liu

COPEN: Probing Conceptual Knowledge in Pre-trained Language Models

Hao Peng, Xiaozhi Wang, Shengding Hu, Hailong Jin, Lei Hou, Juanzi Li, Zhiyuan Liu, Qun Liu

TL;DR

This work introduces COPEN, a large benchmark for probing conceptual knowledge in pre-trained language models, addressing gaps in prior probes that focused on factual knowledge. It constructs a 446-concept taxonomy and 24k annotated instances across three tasks—CSJ, CPJ, and CiC—to test whether PLMs organize entities by concepts, know conceptual properties, and reason about concepts in context. Across multiple model types and probing methods, the study finds that PLMs possess only partial conceptual knowledge, struggle with hierarchical transitivity, and frequently exhibit conceptual hallucinations driven by spurious word co-occurrences. The authors propose that targeted concept-aware pre-training and knowledge-enhanced architectures are needed to approach human-like conceptual understanding, and they provide public data and code to foster further research. Overall, COPEN highlights fundamental limitations in current PLMs and offers a concrete path toward more concept-aware language understanding systems.

Abstract

Conceptual knowledge is fundamental to human cognition and knowledge bases. However, existing knowledge probing works only focus on evaluating factual knowledge of pre-trained language models (PLMs) and ignore conceptual knowledge. Since conceptual knowledge often appears as implicit commonsense behind texts, designing probes for conceptual knowledge is hard. Inspired by knowledge representation schemata, we comprehensively evaluate conceptual knowledge of PLMs by designing three tasks to probe whether PLMs organize entities by conceptual similarities, learn conceptual properties, and conceptualize entities in contexts, respectively. For the tasks, we collect and annotate 24k data instances covering 393 concepts, which is COPEN, a COnceptual knowledge Probing bENchmark. Extensive experiments on different sizes and types of PLMs show that existing PLMs systematically lack conceptual knowledge and suffer from various spurious correlations. We believe this is a critical bottleneck for realizing human-like cognition in PLMs. COPEN and our codes are publicly released at https://github.com/THU-KEG/COPEN.

COPEN: Probing Conceptual Knowledge in Pre-trained Language Models

TL;DR

Abstract

Paper Structure (58 sections, 8 figures, 16 tables)

This paper contains 58 sections, 8 figures, 16 tables.

Introduction
COPEN Benchmark
COPEN Concept Taxonomy
Conceptual Similarity Judgment
Data Collection
Conceptual Property Judgment
Data Collection
Conceptualization in Contexts
Data Collection
Evaluation Setup
Investigated PLMs
Probing Method
Experiment and Analysis
Overall Results
Conceptual Similarity Judgment
...and 43 more sections

Figures (8)

Figure 1: An example knowledge graph. Entities are organized by concepts through the Instance of relation and concepts are organized into a taxonomy through the Subclass of relation. Each concept has certain properties. Existing work only probes factual knowledge in entity graphs, ignoring conceptual knowledge in the concept taxonomy and Instance of relation.
Figure 2: Examples for casting the data of three probing tasks into natural language prompts in zero-shot probing. The names of entities or concepts are the text looked up in Wikidata using their IDs. In Figure (b), texts in bold (true or false) denote answers. In Figure (b) and (c), the concept chain is Horse --> Mammal --> Animal. In Figure (c), for entities with multiple concept chains, each concept will be scored independently by PLMs, i.e., the PLMs make concept-level predictions only. There is no dedicated chain selection procedure.
Figure 3: The false positive rate of BERT's fine-tuning results on CPJ negative instances with different BM25 scores. Results of other PLMs are left in \ref{['sec:appendix_hallucination']}.
Figure 4: Accuracies (%) of various PLMs at different scales. The accuracies on CPJ are instance-level.
Figure 5: The false positive rate of various PLMs' fine-tuning results on negative instances of the CPJ dataset with different BM25 scores.
...and 3 more figures

COPEN: Probing Conceptual Knowledge in Pre-trained Language Models

TL;DR

Abstract

COPEN: Probing Conceptual Knowledge in Pre-trained Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)