Reasoning Elicitation in Language Models via Counterfactual Feedback

Alihan Hüyük; Xinnuo Xu; Jacqueline Maasch; Aditya V. Nori; Javier González

Reasoning Elicitation in Language Models via Counterfactual Feedback

Alihan Hüyük, Xinnuo Xu, Jacqueline Maasch, Aditya V. Nori, Javier González

TL;DR

This work targets the gap between recall and reasoning in large language models by framing causal reasoning as counterfactual reasoning and proposing a formal world-model plus language-model interface. It introduces metrics such as the necessity and sufficiency inconsistency rates (N-IR, S-IR) alongside PN and PS to capture unit-level causal consistency beyond simple accuracy. The authors propose three counterfactual feedback–driven fine-tuning schemes (supervised CF feedback, preference-based CF feedback, and preference-based causal consistency feedback) and demonstrate that combining factual and counterfactual data, especially with causal-consistency guidance, improves inductive generalization of reasoning across in-domain and real-world tasks. The results suggest that carefully paired factual-counterfactual demonstrations and consistency-focused training improve higher-level reasoning (necessity/sufficiency) over baseline recall, with better generalization in inductive transfer while highlighting limits in certain deduction scenarios and binary-variable constraints. Overall, the work advances reasoning elicitation in LLMs and provides a data-centric pathway to train models that reason with counterfactuals rather than rely solely on recall.

Abstract

Despite the increasing effectiveness of language models, their reasoning capabilities remain underdeveloped. In particular, causal reasoning through counterfactual question answering is lacking. This work aims to bridge this gap. We first derive novel metrics that balance accuracy in factual and counterfactual questions, capturing a more complete view of the reasoning abilities of language models than traditional factual-only based metrics. Second, we propose several fine-tuning approaches that aim to elicit better reasoning mechanisms, in the sense of the proposed metrics. Finally, we evaluate the performance of the fine-tuned language models in a variety of realistic scenarios. In particular, we investigate to what extent our fine-tuning approaches systemically achieve better generalization with respect to the base models in several problems that require, among others, inductive and deductive reasoning capabilities.

Reasoning Elicitation in Language Models via Counterfactual Feedback

TL;DR

Abstract

Paper Structure (52 sections, 11 equations, 6 figures, 6 tables, 3 algorithms)

This paper contains 52 sections, 11 equations, 6 figures, 6 tables, 3 algorithms.

Introduction
Contributions.
Fine-tuning for Reasoning
World Model.
Language Model.
Problem.
Modes of Generalization.
Metrics of Reasoning
Correctness
Why are factual and counterfactual correctness alone not enough?
Causal Consistency
Why are PN and PS correctness alone not enough?
An Illustrative Example.
Fine-tuning with Counterfactual Feedback
Supervised Counterfactual Feedback.
...and 37 more sections

Figures (6)

Figure 1: Error rate of Phi3-Mini in answering factual vs. counterfactual questions---sampling 10 answers for each $N\in\{1,\ldots,100\}$. It performs disproportionately better for the factual question (cf. recall) as opposed to the counterfactual question (cf. reasoning).
Figure 2: Different modes of generalization, in terms of the cause-effect relationships demonstrated during fine-tuning (i.e. $\mathcal{D}$, blue) vs. the relationship that the fine-tuned model is evaluated on (i.e. $\mathcal{P}_{X\to Y}$, orange).
Figure 3: Causal consistency vs. correctness. Despite having the same Avg-ER, different types of error distributions lead to widely different PN & PS characteristics.
Figure 4: Summary of the proposed fine tuning methods. Supervised and preference-based counterfactual feedback (CF) target correctness: the former by generating correct answers given each question and the latter by sampling answers and preferring the correct ones over the others. Causal consistency feedback (CCF) targets causal consistency instead: Asking both the factual and the counterfactual questions within the same dialogue allows us to elicit preferences according to relationships between the factual and counterfactual answers.
Figure 5: Hand-crafted puzzle with the original factual question, causal model with structural equations, and a counterfactual question. Blue and orange arrows show the cause-effect interventions demonstrated to the model during fine-tuning and evaluation phases.
...and 1 more figures

Reasoning Elicitation in Language Models via Counterfactual Feedback

TL;DR

Abstract

Reasoning Elicitation in Language Models via Counterfactual Feedback

Authors

TL;DR

Abstract

Table of Contents

Figures (6)