Table of Contents
Fetching ...

Self-Correcting Large Language Models: Generation vs. Multiple Choice

Hossein A. Rahmani, Satyapriya Krishna, Xi Wang, Mohammadmehdi Naghiaei, Emine Yilmaz

TL;DR

This work systematically compares iterative self-correction in large language models for open-ended generation versus multiple-choice answer selection across knowledge-intensive and reasoning tasks. It shows that open-ended generation provides rapid early improvements through reinterpretation but is prone to semantic drift in later rounds, whereas multiple-choice corrections are more stable yet limited by the fixed option space and logit inertia. Model scale and prompting strategies yield modest, task-dependent gains, with larger models helping primarily on harder tasks and reasoning prompts aiding difficult questions but offering limited universal benefits. The findings underscore an adaptability–stability trade-off and motivate hybrid approaches that combine exploratory generation with constrained verification and dynamic stopping to harness the strengths of both paradigms.

Abstract

Large language models have recently demonstrated remarkable abilities to self-correct their responses through iterative refinement, often referred to as self-consistency or self-reflection. However, the dynamics of this self-correction mechanism may differ substantially depending on whether the model is tasked with open-ended text generation or with selecting the most appropriate response from multiple predefined options. In this paper, we conduct a systematic investigation of these two paradigms by comparing performance trends and error-correction behaviors across various natural language understanding and reasoning tasks, covering language models of different scales and families. Our experimental results reveal distinct patterns of improvement and failure modes: \textit{While open-ended generation often benefits from the flexibility of re-interpretation and compositional refinement, multiple-choice selection can leverage clearer solution boundaries but may be limited by the provided options}. This contrast also reflects the dual demands faced by emerging agentic LLM applications: effective agents must not only generate and refine open-ended plans or explanations, but also make reliable discrete choices when operating within constrained action spaces. Our findings, therefore, highlight that the design of self-correction mechanisms should take into account the interaction between task structure and output space, with implications for both knowledge-intensive reasoning and decision-oriented applications of LLMs.

Self-Correcting Large Language Models: Generation vs. Multiple Choice

TL;DR

This work systematically compares iterative self-correction in large language models for open-ended generation versus multiple-choice answer selection across knowledge-intensive and reasoning tasks. It shows that open-ended generation provides rapid early improvements through reinterpretation but is prone to semantic drift in later rounds, whereas multiple-choice corrections are more stable yet limited by the fixed option space and logit inertia. Model scale and prompting strategies yield modest, task-dependent gains, with larger models helping primarily on harder tasks and reasoning prompts aiding difficult questions but offering limited universal benefits. The findings underscore an adaptability–stability trade-off and motivate hybrid approaches that combine exploratory generation with constrained verification and dynamic stopping to harness the strengths of both paradigms.

Abstract

Large language models have recently demonstrated remarkable abilities to self-correct their responses through iterative refinement, often referred to as self-consistency or self-reflection. However, the dynamics of this self-correction mechanism may differ substantially depending on whether the model is tasked with open-ended text generation or with selecting the most appropriate response from multiple predefined options. In this paper, we conduct a systematic investigation of these two paradigms by comparing performance trends and error-correction behaviors across various natural language understanding and reasoning tasks, covering language models of different scales and families. Our experimental results reveal distinct patterns of improvement and failure modes: \textit{While open-ended generation often benefits from the flexibility of re-interpretation and compositional refinement, multiple-choice selection can leverage clearer solution boundaries but may be limited by the provided options}. This contrast also reflects the dual demands faced by emerging agentic LLM applications: effective agents must not only generate and refine open-ended plans or explanations, but also make reliable discrete choices when operating within constrained action spaces. Our findings, therefore, highlight that the design of self-correction mechanisms should take into account the interaction between task structure and output space, with implications for both knowledge-intensive reasoning and decision-oriented applications of LLMs.

Paper Structure

This paper contains 32 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Average cumulative accuracy on generation and multiple-choice. (Top) Accuracy on the DisambiguationQA dataset shows that models perform better on the multiple-choice task when we iteratively self-correct the model response to the questions, while (bottom) shows the accuracy on the tinyTruthfulQA dataset, indicating that models perform better in generation tasks.
  • Figure 2: Average Correct and Incorrect Flips on DisambiguationQA
  • Figure 3: Average Correct and Incorrect Flips on tinyTruthfulQA
  • Figure 4: Accuracy per iteration per model on generation and multiple-choice.
  • Figure 5: Cumulative accuracy (after final self-correction iteration) using different models on (top) DisambiguationQA and (bottom) tinyTruthfulQA. The results indicate that models perform completely differently on self-correction of generation and multiple-choice questions, depending on the dataset.
  • ...and 6 more figures