Table of Contents
Fetching ...

Beyond Introspection: Reinforcing Thinking via Externalist Behavioral Feedback

Diji Yang, Linda Zeng, Kezhen Chen, Yi Zhang

TL;DR

The paper tackles unreliability in inference-time reasoning by removing reliance on internal model states and introspection. It introduces DRR, an externalist three-step framework that distills LLM reasoning into behavioral traces, trains a lightweight discriminative model to critique those traces, and uses the DM to iteratively guide the LLM toward reliable conclusions or abstain. DRR is model-agnostic, uses automated data distillation, and demonstrates strong gains on both open-source and closed-source LLMs across commonsense and knowledge-intensive benchmarks, outperforming self-critique baselines. The approach offers a scalable, low-cost path to enhance reasoning reliability without fine-tuning base models, with implications for wide deployment in safety- and reliability-critical tasks.

Abstract

While inference-time thinking allows Large Language Models (LLMs) to address complex problems, the extended thinking process can be unreliable or inconsistent because of the model's probabilistic nature, especially near its knowledge boundaries. Existing approaches attempt to mitigate this by having the model critique its own reasoning to make corrections. However, such self-critique inherits the same biases of the original output, known as the introspection illusion. Moving beyond such introspection and inspired by core methodologies in ethology, we propose an externalist three-step framework Distillation-Reinforcement-Reasoning (DRR). Rather than relying on a model's introspection, DRR evaluates its observable behaviors to provide corrective feedback. DRR first distills the reasoner's behavioral traces, then trains a lightweight, external Discriminative Model (DM). At inference time, this DM acts as a critic, identifying and rejecting suspicious reasoning steps. This external feedback compels the LLM to discard flawed pathways and explore alternatives, thereby enhancing reasoning quality without altering the base model. Experiments on multiple reasoning benchmarks show that our framework significantly outperforms prominent self-critique methods. Benefiting from a lightweight and annotation-free design, DRR offers a scalable and adaptable solution for improving the reliability of reasoning in a wide range of LLMs.

Beyond Introspection: Reinforcing Thinking via Externalist Behavioral Feedback

TL;DR

The paper tackles unreliability in inference-time reasoning by removing reliance on internal model states and introspection. It introduces DRR, an externalist three-step framework that distills LLM reasoning into behavioral traces, trains a lightweight discriminative model to critique those traces, and uses the DM to iteratively guide the LLM toward reliable conclusions or abstain. DRR is model-agnostic, uses automated data distillation, and demonstrates strong gains on both open-source and closed-source LLMs across commonsense and knowledge-intensive benchmarks, outperforming self-critique baselines. The approach offers a scalable, low-cost path to enhance reasoning reliability without fine-tuning base models, with implications for wide deployment in safety- and reliability-critical tasks.

Abstract

While inference-time thinking allows Large Language Models (LLMs) to address complex problems, the extended thinking process can be unreliable or inconsistent because of the model's probabilistic nature, especially near its knowledge boundaries. Existing approaches attempt to mitigate this by having the model critique its own reasoning to make corrections. However, such self-critique inherits the same biases of the original output, known as the introspection illusion. Moving beyond such introspection and inspired by core methodologies in ethology, we propose an externalist three-step framework Distillation-Reinforcement-Reasoning (DRR). Rather than relying on a model's introspection, DRR evaluates its observable behaviors to provide corrective feedback. DRR first distills the reasoner's behavioral traces, then trains a lightweight, external Discriminative Model (DM). At inference time, this DM acts as a critic, identifying and rejecting suspicious reasoning steps. This external feedback compels the LLM to discard flawed pathways and explore alternatives, thereby enhancing reasoning quality without altering the base model. Experiments on multiple reasoning benchmarks show that our framework significantly outperforms prominent self-critique methods. Benefiting from a lightweight and annotation-free design, DRR offers a scalable and adaptable solution for improving the reliability of reasoning in a wide range of LLMs.
Paper Structure (33 sections, 1 equation, 3 figures, 2 tables, 1 algorithm)

This paper contains 33 sections, 1 equation, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the three-step Distllation-Reinforcement-Reasoning (DRR) pipeline.
  • Figure 2: Critic-decision accuracy Acc(D) for Abstain, Self-Critic, and DRR settings using Llama3 and GPT-4 as Reasoner.
  • Figure 3: Examples from CommonsenseQA and OpenBookQA with DRR. Left: successful correction and acceptance; Center: continued rejection leading to abstention; Right: false rejection followed by successful acceptance.