For Generated Text, Is NLI-Neutral Text the Best Text?

Michail Mersinias; Kyle Mahowald

For Generated Text, Is NLI-Neutral Text the Best Text?

Michail Mersinias, Kyle Mahowald

TL;DR

The paper addresses how natural language inference (NLI) judgments can diagnose and improve long-form text generation. It first analyzes GPT-3 outputs with NLI labels from a pre-trained NLI model on the Scarecrow dataset to relate NLI classes to common error types, revealing that neutral text is most prevalent and that the relationship between NLI class and errors is contingent on the nucleus sampling parameter $p$.$P(neutral)$ thresholds and error patterns are then used to design a realtime NLI-guided generation pipeline with GPT-J, evaluating eight configurations (vanilla, ENT, NEU, CON across $p=0.4$ and $p=0.96$) and using a threshold of $P(neutral) > 0.85$ for the NEU strategy. The results show that maximizing neutral text yields the highest overall quality across settings, with entailment helping at higher randomness and contradiction offering gains at lower randomness, though the approach is computationally intensive and demonstrated on GPT-J rather than larger, human-in-the-loop systems. Overall, the work suggests neutral content as a robust target for improving generation quality and provides a framework for integrating NLI into generation, with implications for designing more coherent and less redundant text under varying sampling randomness.

Abstract

We explore incorporating natural language inference (NLI) into the text generative pipeline by using a pre-trained NLI model to assess whether a generated sentence entails, contradicts, or is neutral to the prompt and preceding text. First, we show that the NLI task is predictive of generation errors made by GPT-3. We use these results to develop an NLI-informed generation procedure for GPT-J. Then, we evaluate these generations by obtaining human annotations on error types and overall quality. We find that an NLI strategy of maximizing entailment improves text generation when the nucleus sampling randomness parameter value is high, while one which maximizes contradiction is in fact productive when the parameter value is low. Overall, though, we demonstrate that an NLI strategy of maximizing the neutral class provides the highest quality of generated text (significantly better than the vanilla generations), regardless of parameter value.

For Generated Text, Is NLI-Neutral Text the Best Text?

TL;DR

thresholds and error patterns are then used to design a realtime NLI-guided generation pipeline with GPT-J, evaluating eight configurations (vanilla, ENT, NEU, CON across

and

) and using a threshold of

for the NEU strategy. The results show that maximizing neutral text yields the highest overall quality across settings, with entailment helping at higher randomness and contradiction offering gains at lower randomness, though the approach is computationally intensive and demonstrated on GPT-J rather than larger, human-in-the-loop systems. Overall, the work suggests neutral content as a robust target for improving generation quality and provides a framework for integrating NLI into generation, with implications for designing more coherent and less redundant text under varying sampling randomness.

Abstract

Paper Structure (11 sections, 5 figures, 2 tables)

This paper contains 11 sections, 5 figures, 2 tables.

Introduction
Analysis of GPT-3 Text Through Natural Language Inference
Realtime NLI to Improve Generation
Method
Results
Conclusion
Appendix: Compute
Appendix: Annotation Guidelines
Dataset
Error Types
Annotation Process

Figures (5)

Figure 1: Average holistic ratings for generations from vanilla GPT-J (control), vs. NLI Strategies of maximizing for neutral, contradiction, or entailment, for 2 different choices of parameter values. Neutral performs best in all cases (significantly better than control), but maximizing contradictions is better than the control when randomness is low, and maximizing entailment is better than the control when randomness is high.
Figure 2: Proportion of erroneous examples in Scarecrow per error type for high and low $p$ parameter.
Figure 3: For high and low $p$ (randomness) parameters in Scarecrow, rank correlation between proportion of text showing an error type (y-axis) and the probability of the given NLI class (x-axis).
Figure 4: For our text generation task, the average human annotator ratings for each of 4 Scarecrow error types, broken up by whether we use vanilla GPT-J output (control), maximize neutral NLI relationships in generated text, maximize entailments, or maximize contradictions. Maximizing neutral is best overall, but maximizing entailment is better than maximizing contradiction when randomness is high and vice versa when randomness is low.
Figure 5: Description of redundant, off-prompt, self-contradiction and incoherent error types along with corresponding examples, according to the Scarecrow annotation framework.

For Generated Text, Is NLI-Neutral Text the Best Text?

TL;DR

Abstract

For Generated Text, Is NLI-Neutral Text the Best Text?

Authors

TL;DR

Abstract

Table of Contents

Figures (5)