Table of Contents
Fetching ...

Ask Again, Then Fail: Large Language Models' Vacillations in Judgment

Qiming Xie, Zengzhi Wang, Yi Feng, Rui Xia

TL;DR

The paper identifies a pervasive judgment consistency issue in state-of-the-art LLMs: models often revise correct answers when confronted with follow-up questioning. It proposes a Follow-up Questioning Mechanism and two metrics to quantify this wavering, then demonstrates the universality of the problem across multiple models and domains. To mitigate the issue, it offers training-free prompting strategies and a training-based framework, Unwavering-FQ, which uses polarized preference context distillation and direct preference optimization to preserve initial correct judgments while maintaining overall conversational ability (as reflected in MT-Bench). Empirical results show meaningful improvements in judgment consistency and general capabilities, with data and prompts released to support future research. The work advances evaluation paradigms for LLM reliability and provides practical mitigation paths for both closed-source and open-source models.

Abstract

We observe that current conversational language models often waver in their judgments when faced with follow-up questions, even if the original judgment was correct. This wavering presents a significant challenge for generating reliable responses and building user trust. To comprehensively assess this issue, we introduce a \textsc{Follow-up Questioning Mechanism} along with two metrics to quantify this inconsistency, confirming its widespread presence in current language models. To mitigate this issue, we explore various prompting strategies for closed-source models; moreover, we develop a training-based framework \textsc{Unwavering-FQ} that teaches language models to maintain their originally correct judgments through synthesized high-quality preference data. Our experimental results confirm the effectiveness of our framework and its ability to enhance the general capabilities of models.

Ask Again, Then Fail: Large Language Models' Vacillations in Judgment

TL;DR

The paper identifies a pervasive judgment consistency issue in state-of-the-art LLMs: models often revise correct answers when confronted with follow-up questioning. It proposes a Follow-up Questioning Mechanism and two metrics to quantify this wavering, then demonstrates the universality of the problem across multiple models and domains. To mitigate the issue, it offers training-free prompting strategies and a training-based framework, Unwavering-FQ, which uses polarized preference context distillation and direct preference optimization to preserve initial correct judgments while maintaining overall conversational ability (as reflected in MT-Bench). Empirical results show meaningful improvements in judgment consistency and general capabilities, with data and prompts released to support future research. The work advances evaluation paradigms for LLM reliability and provides practical mitigation paths for both closed-source and open-source models.

Abstract

We observe that current conversational language models often waver in their judgments when faced with follow-up questions, even if the original judgment was correct. This wavering presents a significant challenge for generating reliable responses and building user trust. To comprehensively assess this issue, we introduce a \textsc{Follow-up Questioning Mechanism} along with two metrics to quantify this inconsistency, confirming its widespread presence in current language models. To mitigate this issue, we explore various prompting strategies for closed-source models; moreover, we develop a training-based framework \textsc{Unwavering-FQ} that teaches language models to maintain their originally correct judgments through synthesized high-quality preference data. Our experimental results confirm the effectiveness of our framework and its ability to enhance the general capabilities of models.
Paper Structure (35 sections, 4 equations, 7 figures, 31 tables)

This paper contains 35 sections, 4 equations, 7 figures, 31 tables.

Figures (7)

  • Figure 1: In the teaching process, teachers often question or mislead students based on their answers to ensure genuine understanding.
  • Figure 2: Two forms of the Follow-up Questioning Mechanism. We design three types of questions for follow-up questioning. The Direct Form involves selecting one type of question from the three types to continue the inquiry, while the Progressive Form involves sequentially using the all types of questions for further inquiry.
  • Figure 3: The results of ChatGPT in Direct Form. C, O, and L represent closed-ended, open-ended, and leading questions, respectively. Full results are in Appendix \ref{['sec:appendix-chatgpt-results']}.
  • Figure 4: The results of ChatGPT in Progressive Form. Full results are in Appendix \ref{['sec:appendix-chatgpt-results']}.
  • Figure 5: The impact of different prompts on Modification (Direct Form). Colors denote datasets, and each dataset's three circles reflect results using prompts A, B, and C from Table \ref{['tab:prompt-all']}. See the Appendix \ref{['sec:appendix-chatgpt-results']}, \ref{['sec:appendix-palm2-results']} and \ref{['sec:appendix-vicuna-results']} for full results.
  • ...and 2 more figures