Table of Contents
Fetching ...

MGen: Millions of Naturally Occurring Generics in Context

Gustavo Cilleruelo, Emily Allaway, Barry Haddow, Alexandra Birch

TL;DR

MGen introduces the largest corpus of naturally occurring generic and quantified sentences in context, mined from Zyda via a scalable two-step pipeline that combines syntactic filtering for bare plurals with a RoBERTa-based semantic classifier. The dataset includes over 3 million bare plural generics and 1 million quantified instances across 11 quantifiers, with full document contexts averaging thousands of words each. Empirical analyses show generics in MGen are longer than typical linguistic examples, frequently generalize about people, and exhibit greater diversity than prior datasets across multiple metrics, validating its utility for semantic, cognitive, and NLP research. The work provides open access resources and a framework for extracting context-rich generics, enabling large-scale investigations into genericity and quantification with practical implications for language understanding and modeling.

Abstract

MGen is a dataset of over 4 million naturally occurring generic and quantified sentences extracted from diverse textual sources. Sentences in the dataset have long context documents, corresponding to websites and academic papers, and cover 11 different quantifiers. We analyze the features of generics sentences in the dataset, with interesting insights: generics can be long sentences (averaging over 16 words) and speakers often use them to express generalisations about people. MGen is the biggest and most diverse dataset of naturally occurring generic sentences, opening the door to large-scale computational research on genericity. It is publicly available at https://gustavocilleruelo.com/mgen

MGen: Millions of Naturally Occurring Generics in Context

TL;DR

MGen introduces the largest corpus of naturally occurring generic and quantified sentences in context, mined from Zyda via a scalable two-step pipeline that combines syntactic filtering for bare plurals with a RoBERTa-based semantic classifier. The dataset includes over 3 million bare plural generics and 1 million quantified instances across 11 quantifiers, with full document contexts averaging thousands of words each. Empirical analyses show generics in MGen are longer than typical linguistic examples, frequently generalize about people, and exhibit greater diversity than prior datasets across multiple metrics, validating its utility for semantic, cognitive, and NLP research. The work provides open access resources and a framework for extracting context-rich generics, enabling large-scale investigations into genericity and quantification with practical implications for language understanding and modeling.

Abstract

MGen is a dataset of over 4 million naturally occurring generic and quantified sentences extracted from diverse textual sources. Sentences in the dataset have long context documents, corresponding to websites and academic papers, and cover 11 different quantifiers. We analyze the features of generics sentences in the dataset, with interesting insights: generics can be long sentences (averaging over 16 words) and speakers often use them to express generalisations about people. MGen is the biggest and most diverse dataset of naturally occurring generic sentences, opening the door to large-scale computational research on genericity. It is publicly available at https://gustavocilleruelo.com/mgen

Paper Structure

This paper contains 41 sections, 2 equations, 2 figures, 14 tables.

Figures (2)

  • Figure 1: Sentence length distribution (percentage of samples for every number of words) in the generics and documents of MGen and natural sentences in GenericsKB-Best.
  • Figure D.1: Correspondence of human annotations with RoBERTa classifier scores.