Table of Contents
Fetching ...

Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference

Eric Mitchell, Joseph J. Noh, Siyan Li, William S. Armstrong, Ananth Agarwal, Patrick Liu, Chelsea Finn, Christopher D. Manning

TL;DR

ConCoRD addresses the lack of internal self-consistency in large pre-trained language models by dynamically estimating logical relations between outputs with a pre-trained NLI model and re-ranking candidate answers via a factor-graph MaxSAT inference, all without fine-tuning. The framework combines a base model that proposes multiple outputs with an NLI-driven relation model, augmented by entailment-correction and test-time information injection, to improve QA and VQA performance across BeliefBank, ConVQA, and Natural Questions settings. Key findings show robust improvements in F1 and accuracy, as well as increased consistency, with gains persisting across model sizes and tasks; the approach remains scalable due to off-the-shelf components and efficient MaxSAT solving. The work highlights practical implications for deploying more reliable NLP systems and suggests future work in end-to-end differentiable integration and cross-domain extensions beyond natural language tasks.

Abstract

While large pre-trained language models are powerful, their predictions often lack logical consistency across test inputs. For example, a state-of-the-art Macaw question-answering (QA) model answers 'Yes' to 'Is a sparrow a bird?' and 'Does a bird have feet?' but answers 'No' to 'Does a sparrow have feet?'. To address this failure mode, we propose a framework, Consistency Correction through Relation Detection, or ConCoRD, for boosting the consistency and accuracy of pre-trained NLP models using pre-trained natural language inference (NLI) models without fine-tuning or re-training. Given a batch of test inputs, ConCoRD samples several candidate outputs for each input and instantiates a factor graph that accounts for both the model's belief about the likelihood of each answer choice in isolation and the NLI model's beliefs about pair-wise answer choice compatibility. We show that a weighted MaxSAT solver can efficiently compute high-quality answer choices under this factor graph, improving over the raw model's predictions. Our experiments demonstrate that ConCoRD consistently boosts accuracy and consistency of off-the-shelf closed-book QA and VQA models using off-the-shelf NLI models, notably increasing accuracy of LXMERT on ConVQA by 5% absolute. See https://ericmitchell.ai/emnlp-2022-concord/ for code and data.

Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference

TL;DR

ConCoRD addresses the lack of internal self-consistency in large pre-trained language models by dynamically estimating logical relations between outputs with a pre-trained NLI model and re-ranking candidate answers via a factor-graph MaxSAT inference, all without fine-tuning. The framework combines a base model that proposes multiple outputs with an NLI-driven relation model, augmented by entailment-correction and test-time information injection, to improve QA and VQA performance across BeliefBank, ConVQA, and Natural Questions settings. Key findings show robust improvements in F1 and accuracy, as well as increased consistency, with gains persisting across model sizes and tasks; the approach remains scalable due to off-the-shelf components and efficient MaxSAT solving. The work highlights practical implications for deploying more reliable NLP systems and suggests future work in end-to-end differentiable integration and cross-domain extensions beyond natural language tasks.

Abstract

While large pre-trained language models are powerful, their predictions often lack logical consistency across test inputs. For example, a state-of-the-art Macaw question-answering (QA) model answers 'Yes' to 'Is a sparrow a bird?' and 'Does a bird have feet?' but answers 'No' to 'Does a sparrow have feet?'. To address this failure mode, we propose a framework, Consistency Correction through Relation Detection, or ConCoRD, for boosting the consistency and accuracy of pre-trained NLP models using pre-trained natural language inference (NLI) models without fine-tuning or re-training. Given a batch of test inputs, ConCoRD samples several candidate outputs for each input and instantiates a factor graph that accounts for both the model's belief about the likelihood of each answer choice in isolation and the NLI model's beliefs about pair-wise answer choice compatibility. We show that a weighted MaxSAT solver can efficiently compute high-quality answer choices under this factor graph, improving over the raw model's predictions. Our experiments demonstrate that ConCoRD consistently boosts accuracy and consistency of off-the-shelf closed-book QA and VQA models using off-the-shelf NLI models, notably increasing accuracy of LXMERT on ConVQA by 5% absolute. See https://ericmitchell.ai/emnlp-2022-concord/ for code and data.
Paper Structure (32 sections, 4 equations, 6 figures, 13 tables)

This paper contains 32 sections, 4 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: ConCoRD first generates candidate outputs from the base pre-trained model, then estimates soft pairwise constraints between output choices, and finally finds the most satisfactory choices of answers accounting for both the base model and NLI model's beliefs.
  • Figure 2: An example factor graph for a simplified batch with two questions, $q_1$ = What is the capital of Afghanistan? and $q_2$ = What is the capital of Georgia?. Although Tbilisi is the most likely answer for both questions, the assignment of variables that is best under the estimated contradiction constraint flips the answer to the first question to Kabul. The top-2 answer choices for each question are sampled from the base model, and a soft contradiction constraint is detected between variables $z_1$ (representing the truth of the answer Tbilisi for $q_1$) and $z_3$ (representing the truth of the answer Tbilisi for $q_2$).
  • Figure 3: Change in ConCoRD's exact-match validation accuracy as $\lambda$ (the NLI confidence threshold) and $\beta$ (tradeoff between base model and relation model beliefs) vary, holding relation model RoBERTa-Large ANLI constant. By comparing the maximum value within each column or row, we conclude that ConCoRD is relatively robust to the choice of $\lambda$, which the choice of $\beta$ is more important. Values are those encountered during tuning with base model ViLT on ConVQA validation questions. Gray squares correspond to regions not evaluated during search, and asterisks (***) mark the region where the maximum increase in accuracy occurs.
  • Figure 4: "Good" flip examples from the VQA experiments. The green texts mark the correctly selected answers, while the red texts indicate incorrectly selected answers.
  • Figure 5: "Bad" flip examples from the VQA experiments. The green texts mark the correctly selected answers, while the red texts indicate the incorrectly selected answers. The bolded texts are the correct answers, if generated within the top-2 predictions. From top to bottom, the first image is an example of when the correct answer, "sheet," was not contained in the predicted answers. The second image is an example of when the conversion of QA pair to statement did not occur as intended and the NLI failed to generate the appropriate inferences that could be used to inform correction of "background" to "buildings. The third image shows an example of when an "incorrect" answer (sky) is effectively the same as the "correct" answer (in sky)--only semantically different. The fourth image shows an example of when the model strongly believed in an incorrect answer and changed another correct answer.
  • ...and 1 more figures