For Generated Text, Is NLI-Neutral Text the Best Text?
Michail Mersinias, Kyle Mahowald
TL;DR
The paper addresses how natural language inference (NLI) judgments can diagnose and improve long-form text generation. It first analyzes GPT-3 outputs with NLI labels from a pre-trained NLI model on the Scarecrow dataset to relate NLI classes to common error types, revealing that neutral text is most prevalent and that the relationship between NLI class and errors is contingent on the nucleus sampling parameter $p$.$P(neutral)$ thresholds and error patterns are then used to design a realtime NLI-guided generation pipeline with GPT-J, evaluating eight configurations (vanilla, ENT, NEU, CON across $p=0.4$ and $p=0.96$) and using a threshold of $P(neutral) > 0.85$ for the NEU strategy. The results show that maximizing neutral text yields the highest overall quality across settings, with entailment helping at higher randomness and contradiction offering gains at lower randomness, though the approach is computationally intensive and demonstrated on GPT-J rather than larger, human-in-the-loop systems. Overall, the work suggests neutral content as a robust target for improving generation quality and provides a framework for integrating NLI into generation, with implications for designing more coherent and less redundant text under varying sampling randomness.
Abstract
We explore incorporating natural language inference (NLI) into the text generative pipeline by using a pre-trained NLI model to assess whether a generated sentence entails, contradicts, or is neutral to the prompt and preceding text. First, we show that the NLI task is predictive of generation errors made by GPT-3. We use these results to develop an NLI-informed generation procedure for GPT-J. Then, we evaluate these generations by obtaining human annotations on error types and overall quality. We find that an NLI strategy of maximizing entailment improves text generation when the nucleus sampling randomness parameter value is high, while one which maximizes contradiction is in fact productive when the parameter value is low. Overall, though, we demonstrate that an NLI strategy of maximizing the neutral class provides the highest quality of generated text (significantly better than the vanilla generations), regardless of parameter value.
