Table of Contents
Fetching ...

From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems

Parisa Rabbani, Nimet Beyza Bozdag, Dilek Hakkani-Tür

TL;DR

This work introduces the Conversational Judgment Task (CJT) to test how framing a task from a factual query to a two-turn dialogue affects LLMs’ judgments of a speaker. Using TruthfulQA as a controlled ground truth, it shows that CJT induces significant, model-specific shifts in initial judgments and leaves models broadly vulnerable to persuasive pressure, with some models becoming sycophantic and others overly critical. The study combines quantitative accuracy with McNemar-based significance tests and qualitative analysis to reveal systematic weaknesses in LLM-based judging when social context is present. It provides a reproducible methodology to diagnose and mitigate social susceptibility in dialogue-based evaluation systems, underscoring the need for robust alignment when LLMs act as impartial judges. Overall, the findings highlight that conversational framing—not just factual knowledge—critically shapes the reliability of LLM judges in social arbitration tasks.

Abstract

LLMs are increasingly employed as judges across a variety of tasks, including those involving everyday social interactions. Yet, it remains unclear whether such LLM-judges can reliably assess tasks that require social or conversational judgment. We investigate how an LLM's conviction is changed when a task is reframed from a direct factual query to a Conversational Judgment Task. Our evaluation framework contrasts the model's performance on direct factual queries with its assessment of a speaker's correctness when the same information is presented within a minimal dialogue, effectively shifting the query from "Is this statement correct?" to "Is this speaker correct?". Furthermore, we apply pressure in the form of a simple rebuttal ("The previous answer is incorrect.") to both conditions. This perturbation allows us to measure how firmly the model maintains its position under conversational pressure. Our findings show that while some models like GPT-4o-mini reveal sycophantic tendencies under social framing tasks, others like Llama-8B-Instruct become overly-critical. We observe an average performance change of 9.24% across all models, demonstrating that even minimal dialogue context can significantly alter model judgment, underscoring conversational framing as a key factor in LLM-based evaluation. The proposed framework offers a reproducible methodology for diagnosing model conviction and contributes to the development of more trustworthy dialogue systems.

From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems

TL;DR

This work introduces the Conversational Judgment Task (CJT) to test how framing a task from a factual query to a two-turn dialogue affects LLMs’ judgments of a speaker. Using TruthfulQA as a controlled ground truth, it shows that CJT induces significant, model-specific shifts in initial judgments and leaves models broadly vulnerable to persuasive pressure, with some models becoming sycophantic and others overly critical. The study combines quantitative accuracy with McNemar-based significance tests and qualitative analysis to reveal systematic weaknesses in LLM-based judging when social context is present. It provides a reproducible methodology to diagnose and mitigate social susceptibility in dialogue-based evaluation systems, underscoring the need for robust alignment when LLMs act as impartial judges. Overall, the findings highlight that conversational framing—not just factual knowledge—critically shapes the reliability of LLM judges in social arbitration tasks.

Abstract

LLMs are increasingly employed as judges across a variety of tasks, including those involving everyday social interactions. Yet, it remains unclear whether such LLM-judges can reliably assess tasks that require social or conversational judgment. We investigate how an LLM's conviction is changed when a task is reframed from a direct factual query to a Conversational Judgment Task. Our evaluation framework contrasts the model's performance on direct factual queries with its assessment of a speaker's correctness when the same information is presented within a minimal dialogue, effectively shifting the query from "Is this statement correct?" to "Is this speaker correct?". Furthermore, we apply pressure in the form of a simple rebuttal ("The previous answer is incorrect.") to both conditions. This perturbation allows us to measure how firmly the model maintains its position under conversational pressure. Our findings show that while some models like GPT-4o-mini reveal sycophantic tendencies under social framing tasks, others like Llama-8B-Instruct become overly-critical. We observe an average performance change of 9.24% across all models, demonstrating that even minimal dialogue context can significantly alter model judgment, underscoring conversational framing as a key factor in LLM-based evaluation. The proposed framework offers a reproducible methodology for diagnosing model conviction and contributes to the development of more trustworthy dialogue systems.

Paper Structure

This paper contains 19 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The impact of task framing on LLM judgment. In a direct Factual Inquiry (top), the model provides a correct response. When the same misconception is reframed as a Conversational Judgment Task (bottom), the model's judgment flips, leading to an unsafe, incorrect response.
  • Figure 2: The impact of simple rebuttal pressure on LLM's accuracy. The model changes its answer under minimal pressure.
  • Figure 3: Impact of Rebuttal Pressure on LLM Accuracy across Task Frames. The plots show the accuracy for GPT-4o Mini, Mistral Small 3, Gemma 3 12B, Llama 3.1 8B Instruct and Llama 3.2 3B Instruct before ('Initial') and after ('Post Pressure') a simple rebuttal.
  • Figure 4: Prompt for direct factual query.
  • Figure 5: Prompt for CJT.