Table of Contents
Fetching ...

Abductive Preference Learning

Yijin Ni, Peng Qi

TL;DR

The paper tackles persistent overconfidence in large language models by identifying a bias in standard preference learning: models underutilize counterfactual prompts. It proposes abductive preference learning, which reverses conditioning to optimize the prompt given a response, formalized via an abductive policy and loss (A-DPO) and extended to multitask variants (Multi-DPO/Multi-DPOP). Empirical validation on HaluEval, AlpacaEval, and HumorDB shows abductive methods enhance prompt discrimination without harming response alignment, with multitask training yielding strong gains in both directions (e.g., up to $99.5\%$ response accuracy and $85.0\%$ abductive accuracy on HaluEval; $87.0\%$ humor-discrimination on HumorDB). The results suggest abductive preference learning is a general, complementary fine-tuning paradigm for improving sensitivity to counterfactual inputs across text and multimodal tasks, preserving benefits of conventional preference optimization while mitigating counterfactual blind spots.

Abstract

Frontier large language models such as GPT-5 and Claude Sonnet remain prone to overconfidence even after alignment through Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO). For instance, they tend to offer the same conservative answer "No" to both questions "Can I eat the [food / potato chips] that has been left out overnight?" despite the latter requiring no refridgeration for safe consumption. We find that this failure is potentially attributed to a limitation of existing preference learning: it emphasizes selecting the correct response for a given prompt, while neglecting counterfactual prompts that should alter the response. To address this limitation, we propose abductive preference learning, a fine-tuning paradigm that reverses the conventional conditioning by learning preferences over prompts given a response. To validate this idea, we construct an abductive dataset derived from the HaluEval QA benchmark with 1,001 entries, implementing abductive DPO and its variant DPOP. Experiments reveal complementary strengths: standard methods improve response selection, abductive methods improve prompt discrimination, while a multitask objective unifies both. On the abductive dataset, multitask DPOP boosts accuracy from $90.0\%$ to $99.5\%$ in response selection and $54.7\%$ to $85.0\%$ in prompt discrimination, with qualitative evidence highlighting improved sensitivity to prompt differences. Finally, evaluation on AlpacaEval shows multitask DPOP improves win rate (from $5.26\%$ to $6.17\%$), confirming that abductive preference learning preserves the benefits of conventional preference optimization while addressing the overlooked challenge of counterfactual prompts.

Abductive Preference Learning

TL;DR

The paper tackles persistent overconfidence in large language models by identifying a bias in standard preference learning: models underutilize counterfactual prompts. It proposes abductive preference learning, which reverses conditioning to optimize the prompt given a response, formalized via an abductive policy and loss (A-DPO) and extended to multitask variants (Multi-DPO/Multi-DPOP). Empirical validation on HaluEval, AlpacaEval, and HumorDB shows abductive methods enhance prompt discrimination without harming response alignment, with multitask training yielding strong gains in both directions (e.g., up to response accuracy and abductive accuracy on HaluEval; humor-discrimination on HumorDB). The results suggest abductive preference learning is a general, complementary fine-tuning paradigm for improving sensitivity to counterfactual inputs across text and multimodal tasks, preserving benefits of conventional preference optimization while mitigating counterfactual blind spots.

Abstract

Frontier large language models such as GPT-5 and Claude Sonnet remain prone to overconfidence even after alignment through Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO). For instance, they tend to offer the same conservative answer "No" to both questions "Can I eat the [food / potato chips] that has been left out overnight?" despite the latter requiring no refridgeration for safe consumption. We find that this failure is potentially attributed to a limitation of existing preference learning: it emphasizes selecting the correct response for a given prompt, while neglecting counterfactual prompts that should alter the response. To address this limitation, we propose abductive preference learning, a fine-tuning paradigm that reverses the conventional conditioning by learning preferences over prompts given a response. To validate this idea, we construct an abductive dataset derived from the HaluEval QA benchmark with 1,001 entries, implementing abductive DPO and its variant DPOP. Experiments reveal complementary strengths: standard methods improve response selection, abductive methods improve prompt discrimination, while a multitask objective unifies both. On the abductive dataset, multitask DPOP boosts accuracy from to in response selection and to in prompt discrimination, with qualitative evidence highlighting improved sensitivity to prompt differences. Finally, evaluation on AlpacaEval shows multitask DPOP improves win rate (from to ), confirming that abductive preference learning preserves the benefits of conventional preference optimization while addressing the overlooked challenge of counterfactual prompts.

Paper Structure

This paper contains 16 sections, 1 theorem, 12 equations, 5 figures, 9 tables.

Key Result

Proposition 2.1

Suppose the marginal distribution of prompts, i.e., $p(x)$, is independent from model policies ($\pi_{\text{ref}}$ and $\pi_\theta$). Let $\widetilde{\pi}$ denote the abductive policy induced by $\pi$. Then the A-DPO loss can be expressed as

Figures (5)

  • Figure 1: Abductive preference learning is a general fine-tuning paradigm obtained by switching the roles of prompts and responses. The shaded box illustrates how this principle applies broadly across existing preference learning methods. Abductive DPO and A-DPOP are shown as examples.
  • Figure 2: Example image pair in HumorDB. Left: image rated as funny ($83.3\%$ of participants). Right: modified image rated as not funny ($85.7\%$) of participants. Focus on the phone in the surgeon's hand in the left image.
  • Figure 3: Ablation studies for the weight $\lambda$ of the original preference learning objective. Dashed lines indicate base model performance.
  • Figure 4: Training log-probabilities vs. epoch. Blue line imlies DPO training, while the red line implies A-DPO.
  • Figure 5: Effect of threshold $\delta$ on generalization. We employ $(\delta_t, \delta_e)$ to represent the evaluation performance on the dataset generated with $\delta_e$ of the model trained on dataset generated by $\delta_e$. Left: abductive accuracy on A-HaluEval (red). Right: standard accuracy (blue) on HaluEval. Each curve reports performance across 4 epochs; for the model trained with $\delta=1.0$, checkpoints are averaged into two per epoch.

Theorems & Definitions (1)

  • Proposition 2.1