Table of Contents
Fetching ...

CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text

Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, William L. Hamilton

TL;DR

CLUTRR introduces a semi-synthetic benchmark to probe inductive reasoning and systematic generalization in natural language understanding by requiring inference of kinship relations from stories. The dataset uses a graph-based knowledge base to generate reasoning chains, which are paraphrased into natural language, with controlled holdouts and noise to test generalization and robustness. Empirical results show graph-attention networks operating on structured representations outperform state-of-the-art text-based models on systematic generalization tasks and robustly handle noisy conditions, highlighting a gap in current NL understanding approaches. The work provides a diagnostic framework for developing more compositional, modular, and robust NLU systems.

Abstract

The recent success of natural language understanding (NLU) systems has been troubled by results highlighting the failure of these models to generalize in a systematic and robust way. In this work, we introduce a diagnostic benchmark suite, named CLUTRR, to clarify some key issues related to the robustness and systematicity of NLU systems. Motivated by classic work on inductive logic programming, CLUTRR requires that an NLU system infer kinship relations between characters in short stories. Successful performance on this task requires both extracting relationships between entities, as well as inferring the logical rules governing these relationships. CLUTRR allows us to precisely measure a model's ability for systematic generalization by evaluating on held-out combinations of logical rules, and it allows us to evaluate a model's robustness by adding curated noise facts. Our empirical results highlight a substantial performance gap between state-of-the-art NLU models (e.g., BERT and MAC) and a graph neural network model that works directly with symbolic inputs---with the graph-based model exhibiting both stronger generalization and greater robustness.

CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text

TL;DR

CLUTRR introduces a semi-synthetic benchmark to probe inductive reasoning and systematic generalization in natural language understanding by requiring inference of kinship relations from stories. The dataset uses a graph-based knowledge base to generate reasoning chains, which are paraphrased into natural language, with controlled holdouts and noise to test generalization and robustness. Empirical results show graph-attention networks operating on structured representations outperform state-of-the-art text-based models on systematic generalization tasks and robustly handle noisy conditions, highlighting a gap in current NL understanding approaches. The work provides a diagnostic framework for developing more compositional, modular, and robust NLU systems.

Abstract

The recent success of natural language understanding (NLU) systems has been troubled by results highlighting the failure of these models to generalize in a systematic and robust way. In this work, we introduce a diagnostic benchmark suite, named CLUTRR, to clarify some key issues related to the robustness and systematicity of NLU systems. Motivated by classic work on inductive logic programming, CLUTRR requires that an NLU system infer kinship relations between characters in short stories. Successful performance on this task requires both extracting relationships between entities, as well as inferring the logical rules governing these relationships. CLUTRR allows us to precisely measure a model's ability for systematic generalization by evaluating on held-out combinations of logical rules, and it allows us to evaluate a model's robustness by adding curated noise facts. Our empirical results highlight a substantial performance gap between state-of-the-art NLU models (e.g., BERT and MAC) and a graph neural network model that works directly with symbolic inputs---with the graph-based model exhibiting both stronger generalization and greater robustness.

Paper Structure

This paper contains 25 sections, 1 equation, 8 figures, 11 tables.

Figures (8)

  • Figure 1: CLUTRR inductive reasoning task.
  • Figure 2: Data generation pipeline. Step 1: generate a kinship graph. Step 2: sample a target fact. Step 3: Use backward chaining to sample a set of facts. Step 4: Convert sampled facts to a natural language story.
  • Figure 3: Illustration of how a set of facts can split and combined in various ways across sentences.
  • Figure 4: Noise generation procedures of CLUTRR.
  • Figure 5: Systematic generalization performance of different models when trained on clauses of length $k=2,3$ (Left) and $k=2,3,4$ (Right).
  • ...and 3 more figures