Table of Contents
Fetching ...

GenericsKB: A Knowledge Base of Generic Statements

Sumithra Bhakthavatsalam, Chloe Anastasiades, Peter Clark

TL;DR

GenericsKB tackles the need for large-scale, naturally occurring generic statements by assembling 3.4M generics from cleaned web and ARC corpora and refining them with a 27-rule filter and a BERT-based usefulness classifier. The authors release GenericsKB and GenericsKB-Best (1.0M) with rich metadata, including topic and contextual sentences, to support NLP tasks and linguistic studies. Evaluations on OpenBookQA and QASC show that GenericsKB-Best can improve multi-hop reasoning and explanation quality compared to using a larger, less filtered corpus. The resource offers practical utility for QA, explanation generation, and semantic study of generics, and is publicly accessible for the community.

Abstract

We present a new resource for the NLP community, namely a large (3.5M+ sentence) knowledge base of *generic statements*, e.g., "Trees remove carbon dioxide from the atmosphere", collected from multiple corpora. This is the first large resource to contain *naturally occurring* generic sentences, as opposed to extracted or crowdsourced triples, and thus is rich in high-quality, general, semantically complete statements. All GenericsKB sentences are annotated with their topical term, surrounding context (sentences), and a (learned) confidence. We also release GenericsKB-Best (1M+ sentences), containing the best-quality generics in GenericsKB augmented with selected, synthesized generics from WordNet and ConceptNet. In tests on two existing datasets requiring multihop reasoning (OBQA and QASC), we find using GenericsKB can result in higher scores and better explanations than using a much larger corpus. This demonstrates that GenericsKB can be a useful resource for NLP applications, as well as providing data for linguistic studies of generics and their semantics. GenericsKB is available at https://allenai.org/data/genericskb.

GenericsKB: A Knowledge Base of Generic Statements

TL;DR

GenericsKB tackles the need for large-scale, naturally occurring generic statements by assembling 3.4M generics from cleaned web and ARC corpora and refining them with a 27-rule filter and a BERT-based usefulness classifier. The authors release GenericsKB and GenericsKB-Best (1.0M) with rich metadata, including topic and contextual sentences, to support NLP tasks and linguistic studies. Evaluations on OpenBookQA and QASC show that GenericsKB-Best can improve multi-hop reasoning and explanation quality compared to using a larger, less filtered corpus. The resource offers practical utility for QA, explanation generation, and semantic study of generics, and is publicly accessible for the community.

Abstract

We present a new resource for the NLP community, namely a large (3.5M+ sentence) knowledge base of *generic statements*, e.g., "Trees remove carbon dioxide from the atmosphere", collected from multiple corpora. This is the first large resource to contain *naturally occurring* generic sentences, as opposed to extracted or crowdsourced triples, and thus is rich in high-quality, general, semantically complete statements. All GenericsKB sentences are annotated with their topical term, surrounding context (sentences), and a (learned) confidence. We also release GenericsKB-Best (1M+ sentences), containing the best-quality generics in GenericsKB augmented with selected, synthesized generics from WordNet and ConceptNet. In tests on two existing datasets requiring multihop reasoning (OBQA and QASC), we find using GenericsKB can result in higher scores and better explanations than using a much larger corpus. This demonstrates that GenericsKB can be a useful resource for NLP applications, as well as providing data for linguistic studies of generics and their semantics. GenericsKB is available at https://allenai.org/data/genericskb.

Paper Structure

This paper contains 12 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Example generic statements in GenericsKB, plus one showing associated metadata.
  • Figure 2: Example filtering rules. (See supplementary material for the full list).