Judgment-of-Thought Prompting: A Courtroom-Inspired Framework for Binary Logical Reasoning with Large Language Models

Sungjune Park; Heehwan Kim; Haehyun Cho; Daeseon Choi

Judgment-of-Thought Prompting: A Courtroom-Inspired Framework for Binary Logical Reasoning with Large Language Models

Sungjune Park, Heehwan Kim, Haehyun Cho, Daeseon Choi

TL;DR

JoT tackles binary logical reasoning in LLMs by introducing a courtroom-inspired three-role prompting framework (lawyer, prosecutor, judge) that enables adversarial yet structured debate and iterative refinement. The high-level judge evaluates argument quality from lower-level lawyers, yielding improved accuracy, consistency, and interpretability across diverse tasks. Empirical results on BigBenchHard and Winogrande show strong gains (e.g., 98% on Boolean Expressions, 90% on Web of Lies, 89% on Winogrande) and ablations confirm the necessity of each role and the iterative feedback loop. The work suggests JoT's potential for reliable decision-making in real-world domains, with future directions including domain-specific retrieval augmentation and efficiency improvements.

Abstract

This paper proposes a novel prompting approach, Judgment of Thought (JoT), specifically tailored for binary logical reasoning tasks. Despite advances in prompt engineering, existing approaches still face limitations in handling complex logical reasoning tasks. To address these issues, JoT introduces a multi-agent approach with three specialized roles$\unicode{x2010}$$\unicode{x2010}$$\unicode{x2010}$lawyer, prosecutor, and judge$\unicode{x2010}$$\unicode{x2010}$$\unicode{x2010}$where a high-level model acts as the judge, and lower-level models serve as lawyer and prosecutor to systematically debate and evaluate arguments. Experimental evaluations on benchmarks such as BigBenchHard and Winogrande demonstrate JoT's superior performance compared to existing prompting approaches, achieving notable improvements, including 98\% accuracy in Boolean expressions. Also, our ablation studies validate the critical contribution of each role, iterative refinement loops, and feedback mechanisms. Consequently, JoT significantly enhances accuracy, reliability, and consistency in binary reasoning tasks and shows potential for practical applications.

Judgment-of-Thought Prompting: A Courtroom-Inspired Framework for Binary Logical Reasoning with Large Language Models

TL;DR

Abstract

lawyer, prosecutor, and judge

where a high-level model acts as the judge, and lower-level models serve as lawyer and prosecutor to systematically debate and evaluate arguments. Experimental evaluations on benchmarks such as BigBenchHard and Winogrande demonstrate JoT's superior performance compared to existing prompting approaches, achieving notable improvements, including 98\% accuracy in Boolean expressions. Also, our ablation studies validate the critical contribution of each role, iterative refinement loops, and feedback mechanisms. Consequently, JoT significantly enhances accuracy, reliability, and consistency in binary reasoning tasks and shows potential for practical applications.

Paper Structure (12 sections, 5 figures, 6 tables)

This paper contains 12 sections, 5 figures, 6 tables.

Introduction
Background
Judgment of Thought (JoT)
Evaluation
Evaluation Setup
Evaluation Result on Benchmarks
Ablation Study on JoT
Discussion
Conclusion
Limitation
Used prompts for JoT
Resampling Results: Comparison of the Existing Prompt Engineering techniques and JoT

Figures (5)

Figure 1: Comparison of Judgment of Thought (ours) with recent prompting strategies.
Figure 2: Judgment of Thought (JoT) Architecture. It consists of three roles: lawyer, prosecutor, and judge. The lawyer and prosecutor use lower-level models to argue different aspects of a problem. The judge uses a higher-level model to evaluate these arguments and deliver a comprehensive judgment. This process enables thorough analysis from multiple perspectives, leading to balanced solutions for complex problems.
Figure 3: Case studies highlighting how JoT resolves binary reasoning tasks through adversarial dialogue.
Figure 4: Comparative illustration of the reasoning paradigms in CoT, Debate (Khan et al.), and the proposed Judgment of Thought(ours) frameworks.
Figure 5: Boxplots illustrating the resampling results, comparing the variability and robustness of existing prompt engineering techniques and JoT. Self-Consistency was excluded from this comparison due to its reliance on repeated executions, which incur substantial computational costs. For a detailed comparison of trends between Self-Consistency and other methods, please refer to Table \ref{['table1']}

Judgment-of-Thought Prompting: A Courtroom-Inspired Framework for Binary Logical Reasoning with Large Language Models

TL;DR

Abstract

Judgment-of-Thought Prompting: A Courtroom-Inspired Framework for Binary Logical Reasoning with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)