Table of Contents
Fetching ...

Town Hall Debate Prompting: Enhancing Logical Reasoning in LLMs through Multi-Persona Interaction

Vivaan Sandwar, Bhav Jain, Rishan Thangaraj, Ishaan Garg, Michael Lam, Kevin Zhu

TL;DR

The paper introduces Town Hall-Style Debate Prompting (THDP), a single-LLM prompting framework that assigns multiple expert personas to engage in a structured debate and vote to determine the final answer, aiming to broaden the reasoning space and reduce errors in complex tasks. THDP is evaluated on the ZebraLogic benchmark, comparing against 1-shot Chain-of-Thought (CoT) prompts across MCQ and ZebraGrid tasks with GPT-4o, GPT-4o Mini, and Claude 3.5 Sonnet. Results show that THDP, particularly with around five personas, yields notable improvements in cell and puzzle accuracies, with larger models gaining more from the approach. The work demonstrates THDP as a scalable method to amplify reasoning capabilities without external agents or retrieval, albeit at the cost of higher token use and potential tangles in smaller models.

Abstract

Debate is a commonly used form of human communication catered towards problem-solving because of its efficiency. Debate fundamentally allows multiple viewpoints to be brought up in problem-solving, and for complex problems, each viewpoint opens a new path for problem-solving. In this work, we apply this concept to LLM decision-making by proposing town hall-style debate prompting (THDP), a prompting method that splices a language model into multiple personas that will debate one another to reach a conclusion. Our experimental pipeline varies both the number of personas and the personality types of each persona to find the optimum town hall size and personality for benchmark performance as measured by ZebraLogic bench, a reasoning-intensive benchmark characterized by both multiple-choice and fill-in-the-blank questions. Our experimental results demonstrate that a town hall size of 5 personas with LLM-determined personality types performs optimally on ZebraLogic, achieving a 13\% improvement over one-shot CoT baselines in per-cell accuracy in GPT-4o, 9% puzzle accuracy increase in Claude 3.5 Sonnet, and an improvement in hard puzzle accuracy from 10-15%.

Town Hall Debate Prompting: Enhancing Logical Reasoning in LLMs through Multi-Persona Interaction

TL;DR

The paper introduces Town Hall-Style Debate Prompting (THDP), a single-LLM prompting framework that assigns multiple expert personas to engage in a structured debate and vote to determine the final answer, aiming to broaden the reasoning space and reduce errors in complex tasks. THDP is evaluated on the ZebraLogic benchmark, comparing against 1-shot Chain-of-Thought (CoT) prompts across MCQ and ZebraGrid tasks with GPT-4o, GPT-4o Mini, and Claude 3.5 Sonnet. Results show that THDP, particularly with around five personas, yields notable improvements in cell and puzzle accuracies, with larger models gaining more from the approach. The work demonstrates THDP as a scalable method to amplify reasoning capabilities without external agents or retrieval, albeit at the cost of higher token use and potential tangles in smaller models.

Abstract

Debate is a commonly used form of human communication catered towards problem-solving because of its efficiency. Debate fundamentally allows multiple viewpoints to be brought up in problem-solving, and for complex problems, each viewpoint opens a new path for problem-solving. In this work, we apply this concept to LLM decision-making by proposing town hall-style debate prompting (THDP), a prompting method that splices a language model into multiple personas that will debate one another to reach a conclusion. Our experimental pipeline varies both the number of personas and the personality types of each persona to find the optimum town hall size and personality for benchmark performance as measured by ZebraLogic bench, a reasoning-intensive benchmark characterized by both multiple-choice and fill-in-the-blank questions. Our experimental results demonstrate that a town hall size of 5 personas with LLM-determined personality types performs optimally on ZebraLogic, achieving a 13\% improvement over one-shot CoT baselines in per-cell accuracy in GPT-4o, 9% puzzle accuracy increase in Claude 3.5 Sonnet, and an improvement in hard puzzle accuracy from 10-15%.

Paper Structure

This paper contains 25 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Schematic illustration of Townhall-Style Debate Prompting (THDP) and the difference compared to previous prompting methods.
  • Figure 2: By using various persona counts, we can see how model output varies and performs. We note that a persona count of 5 performs generally the best.
  • Figure 3: THDP demonstrates better results across the board on MCQ-style problems. Higher correct is better, lower incorrect and blank is better.
  • Figure 4: Visual depiction of what a grid benchmark task might look like in this case, specifically the ZebraLogic Benchmark
  • Figure 5: THDP shows weaker results on smaller models such as GPT-4o-Mini. Higher Cell, Easy, Hard, and Puzzle accuracies are better. A lower blank is better.
  • ...and 1 more figures