Table of Contents
Fetching ...

CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning

Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, Xiang Ren

TL;DR

The paper introduces CommonGen, a constrained text-generation benchmark for generative commonsense reasoning, where models must produce a coherent sentence using a given set of concepts. It presents a large dataset (35k concept-sets, ~77k sentences) built from caption sources and crowdsourced references with rigorous quality control. Experiments show a substantial gap between state-of-the-art pretrained models and human performance on SPICE-based evaluation, underscoring the task's difficulty. The authors demonstrate that generative commonsense contexts learned by these models can benefit downstream tasks like CommonsenseQA, illustrating practical value for enhancing downstream natural language understanding and generation. Overall, the work advocates constrained generation with commonsense reasoning as a productive direction for advancing NLG systems.

Abstract

Recently, large-scale pre-trained language models have demonstrated impressive performance on several commonsense-reasoning benchmark datasets. However, building machines with commonsense to compose realistically plausible sentences remains challenging. In this paper, we present a constrained text generation task, CommonGen associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts (e.g., {dog, frisbee, catch, throw}); the task is to generate a coherent sentence describing an everyday scenario using these concepts (e.g., "a man throws a frisbee and his dog catches it"). The CommonGen task is challenging because it inherently requires 1) relational reasoning with background commonsense knowledge, and 2) compositional generalization ability to work on unseen concept combinations. Our dataset, constructed through a combination of crowdsourced and existing caption corpora, consists of 79k commonsense descriptions over 35k unique concept-sets. Experiments show that there is a large gap between state-of-the-art text generation models (e.g., T5) and human performance. Furthermore, we demonstrate that the learned generative commonsense reasoning capability can be transferred to improve downstream tasks such as CommonsenseQA by generating additional context.

CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning

TL;DR

The paper introduces CommonGen, a constrained text-generation benchmark for generative commonsense reasoning, where models must produce a coherent sentence using a given set of concepts. It presents a large dataset (35k concept-sets, ~77k sentences) built from caption sources and crowdsourced references with rigorous quality control. Experiments show a substantial gap between state-of-the-art pretrained models and human performance on SPICE-based evaluation, underscoring the task's difficulty. The authors demonstrate that generative commonsense contexts learned by these models can benefit downstream tasks like CommonsenseQA, illustrating practical value for enhancing downstream natural language understanding and generation. Overall, the work advocates constrained generation with commonsense reasoning as a productive direction for advancing NLG systems.

Abstract

Recently, large-scale pre-trained language models have demonstrated impressive performance on several commonsense-reasoning benchmark datasets. However, building machines with commonsense to compose realistically plausible sentences remains challenging. In this paper, we present a constrained text generation task, CommonGen associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts (e.g., {dog, frisbee, catch, throw}); the task is to generate a coherent sentence describing an everyday scenario using these concepts (e.g., "a man throws a frisbee and his dog catches it"). The CommonGen task is challenging because it inherently requires 1) relational reasoning with background commonsense knowledge, and 2) compositional generalization ability to work on unseen concept combinations. Our dataset, constructed through a combination of crowdsourced and existing caption corpora, consists of 79k commonsense descriptions over 35k unique concept-sets. Experiments show that there is a large gap between state-of-the-art text generation models (e.g., T5) and human performance. Furthermore, we demonstrate that the learned generative commonsense reasoning capability can be transferred to improve downstream tasks such as CommonsenseQA by generating additional context.

Paper Structure

This paper contains 18 sections, 1 equation, 12 figures, 8 tables.

Figures (12)

  • Figure 1: An example of the dataset of CommonGen. GPT-2, UniLM, BART and T5 are large pre-trained text generation models, fine-tuned on the proposed task.
  • Figure 2: Two key challenges of CommonGen: relational reasoning with underlying commonsense knowledge about given concepts (left), and compositional generalization for unseen combinations of concepts (right).
  • Figure 2: The distributions of the relation categories on one/two-hop connections.
  • Figure 3: Dataset construction workflow overview.
  • Figure 4: The curve of inter-annotator agreement (IAA) in terms of their std (up) and median (bottom) when average number of references increase.
  • ...and 7 more figures