Table of Contents
Fetching ...

CLEAR: Contrasting Textual Feedback with Experts and Amateurs for Reasoning

Andrew Rufail, Daniel Kim, Sean O'Brien, Kevin Zhu

TL;DR

CLEAR presents a contrastive-feedback framework that pairs a large expert LM with a smaller amateur LM to critique outputs, contrasts their feedback, and uses a lightweight feedback refinement loop to improve reasoning. It introduces Node Evaluator and Feedback Filter modules and a BeClear best-first search to efficiently reach high-quality solutions, outperforming several prompting-based and tree-based baselines across constrained generation, story outlining, mathematical reasoning, and toxicity mitigation. The results demonstrate strong gains with modest iteration depth ($d\leq 3$) and good generalization to different model families, while maintaining computational efficiency. This approach offers a practical, scalable path to improving reasoning in LLMs and could extend to bias reduction, safety, and other decision-making tasks in real-world applications.

Abstract

We introduce CLEAR (Contrasting Textual Feedback with Experts and Amateurs for Reasoning), a novel approach to language model reasoning that leverages the strengths of a larger (expert) model and smaller (amateur) model. The expert and amateur models each provide feedback on a model's initial output and are contrasted with each other into refined feedback. This feedback is subsequently applied to iteratively improve CLEAR's responses. Our experiments demonstrate that CLEAR outperforms state-of-the-art methods in several challenging reasoning tasks, including story outline improvement (up to 19.6% relative increase in interestingness), constrained generation (up to 18.5% increase in coverage), mathematical reasoning (up to 6.7% improvement in accuracy) and mitigation of toxicity (decrease of up to 22% in toxicity).

CLEAR: Contrasting Textual Feedback with Experts and Amateurs for Reasoning

TL;DR

CLEAR presents a contrastive-feedback framework that pairs a large expert LM with a smaller amateur LM to critique outputs, contrasts their feedback, and uses a lightweight feedback refinement loop to improve reasoning. It introduces Node Evaluator and Feedback Filter modules and a BeClear best-first search to efficiently reach high-quality solutions, outperforming several prompting-based and tree-based baselines across constrained generation, story outlining, mathematical reasoning, and toxicity mitigation. The results demonstrate strong gains with modest iteration depth () and good generalization to different model families, while maintaining computational efficiency. This approach offers a practical, scalable path to improving reasoning in LLMs and could extend to bias reduction, safety, and other decision-making tasks in real-world applications.

Abstract

We introduce CLEAR (Contrasting Textual Feedback with Experts and Amateurs for Reasoning), a novel approach to language model reasoning that leverages the strengths of a larger (expert) model and smaller (amateur) model. The expert and amateur models each provide feedback on a model's initial output and are contrasted with each other into refined feedback. This feedback is subsequently applied to iteratively improve CLEAR's responses. Our experiments demonstrate that CLEAR outperforms state-of-the-art methods in several challenging reasoning tasks, including story outline improvement (up to 19.6% relative increase in interestingness), constrained generation (up to 18.5% increase in coverage), mathematical reasoning (up to 6.7% improvement in accuracy) and mitigation of toxicity (decrease of up to 22% in toxicity).

Paper Structure

This paper contains 25 sections, 4 equations, 5 figures, 8 tables, 2 algorithms.

Figures (5)

  • Figure 1: This diagram demonstrated the two variants of Clear and shows how the best-first search is leveraged to improve the most promising nodes only.
  • Figure 2: Cosine similarity analysis between different feedback types using text-embeddings-3-small. These results are the average similarities aggregated across 200 data points in the GSM8K and CommonGen-Hard experiments. The bar graph shows that the expert and amateur model feedback are semantically different, and the filtered feedback also contains different content. Furthermore, if two of the same models are used instead (red bars), the filtered feedback does not contain significantly different content, worsening Clear's performance.
  • Figure 3: Different heuristics for the best-first search were tested on GSM8K with $d=5$ (see section 3.3). Expert weighted: $100 - |1.5v(expert) - v(amateur)|$, Equal weighting: $100 - |v(expert) - v(amateur)|$, Expert only: $100 - v(expert)$, Amateur only: $100 - v(amateur)$.
  • Figure 4: In this diagram, we demonstrate that the expert and amateur models' feedback are processed in constrained generation.
  • Figure 5: Further data for Clear's performance on CommonGen-Hard which shows how d affects the concept coverage and sentence relevance.