Table of Contents
Fetching ...

Probing Neural Network Comprehension of Natural Language Arguments

Timothy Niven, Hung-Yu Kao

TL;DR

This paper tackles whether BERT truly understands natural language arguments or merely exploits dataset cues in the Argument Reasoning Comprehension Task (ARCT). It analyzes spurious statistical cues with a formal framework, conducts probing experiments, and constructs an adversarial dataset that mirrors cue distributions to remove signal. The findings show that BERT's high peak accuracy is largely due to cue exploitation, and on the adversarial data its performance drops to essentially random levels, suggesting limited true argument comprehension. The work argues for adopting the adversarial dataset as a standard benchmark to enable robust evaluation and to spur progress in understanding real argumentative reasoning in NLP.

Abstract

We are surprised to find that BERT's peak performance of 77% on the Argument Reasoning Comprehension Task reaches just three points below the average untrained human baseline. However, we show that this result is entirely accounted for by exploitation of spurious statistical cues in the dataset. We analyze the nature of these cues and demonstrate that a range of models all exploit them. This analysis informs the construction of an adversarial dataset on which all models achieve random accuracy. Our adversarial dataset provides a more robust assessment of argument comprehension and should be adopted as the standard in future work.

Probing Neural Network Comprehension of Natural Language Arguments

TL;DR

This paper tackles whether BERT truly understands natural language arguments or merely exploits dataset cues in the Argument Reasoning Comprehension Task (ARCT). It analyzes spurious statistical cues with a formal framework, conducts probing experiments, and constructs an adversarial dataset that mirrors cue distributions to remove signal. The findings show that BERT's high peak accuracy is largely due to cue exploitation, and on the adversarial data its performance drops to essentially random levels, suggesting limited true argument comprehension. The work argues for adopting the adversarial dataset as a standard benchmark to enable robust evaluation and to spur progress in understanding real argumentative reasoning in NLP.

Abstract

We are surprised to find that BERT's peak performance of 77% on the Argument Reasoning Comprehension Task reaches just three points below the average untrained human baseline. However, we show that this result is entirely accounted for by exploitation of spurious statistical cues in the dataset. We analyze the nature of these cues and demonstrate that a range of models all exploit them. This analysis informs the construction of an adversarial dataset on which all models achieve random accuracy. Our adversarial dataset provides a more robust assessment of argument comprehension and should be adopted as the standard in future work.

Paper Structure

This paper contains 8 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: An example of a data point from the ARCT test set and how it should be read. The inference from $R$ and $A$ to $\lnot C$ is by design.
  • Figure 2: General architecture of the models in our experiments. Logits are independently calculated for each argument-warrant pair then concatenated and passed through softmax.
  • Figure 3: Processing an argument-warrant pair with BERT. The reason (with word pieces of length $a$) and claim (length $b$) together form the first utterance, and the warrant (length $c$) is the second. The final CLS vector is then passed to a linear layer to calculate the logit $z^{(i)}_j$.
  • Figure 4: Original and adversarial data points. The claim is negated and the warrants are swapped. The assignment of labels to $W$ and $A$ are kept the same. By including both, the distribution of linguistic artifacts in the warrants are thereby mirrored around the labels, eliminating the major source of spurious statistical cues in ARCT.