Optimizing Language Models for Human Preferences is a Causal Inference Problem

Victoria Lin; Eli Ben-Michael; Louis-Philippe Morency

Optimizing Language Models for Human Preferences is a Causal Inference Problem

Victoria Lin, Eli Ben-Michael, Louis-Philippe Morency

TL;DR

This work reframes language model optimization for human preferences as a causal inference problem using direct outcome data, introducing the causal value $V(f)=E_{X~P^f}[g(X)]$ and identifying it from randomized data via IPW. It then presents causal preference optimization (CPO) as an unbiased surrogate and its doubly robust extension (DR-CPO) that reduces estimator variance by integrating outcome modeling, with unbiasedness guaranteed under either known randomization or correct outcome modeling. Empirical results on Hate Speech and Hong Kong datasets show DR-CPO often yields higher expected rewards and robust performance under confounding, outperforming baselines such as OO-RLHF and standard fine-tuning. The paper demonstrates the practical viability of causal methods for aligning LLMs with human preferences using direct outcome data, and outlines future directions including entropy regularization and extensions to paired-data settings.

Abstract

As large language models (LLMs) see greater use in academic and commercial settings, there is increasing interest in methods that allow language models to generate texts aligned with human preferences. In this paper, we present an initial exploration of language model optimization for human preferences from direct outcome datasets, where each sample consists of a text and an associated numerical outcome measuring the reader's response. We first propose that language model optimization should be viewed as a causal problem to ensure that the model correctly learns the relationship between the text and the outcome. We formalize this causal language optimization problem, and we develop a method--causal preference optimization (CPO)--that solves an unbiased surrogate objective for the problem. We further extend CPO with doubly robust CPO (DR-CPO), which reduces the variance of the surrogate objective while retaining provably strong guarantees on bias. Finally, we empirically demonstrate the effectiveness of (DR-)CPO in optimizing state-of-the-art LLMs for human preferences on direct outcome data, and we validate the robustness of DR-CPO under difficult confounding conditions.

Optimizing Language Models for Human Preferences is a Causal Inference Problem

TL;DR

This work reframes language model optimization for human preferences as a causal inference problem using direct outcome data, introducing the causal value

and identifying it from randomized data via IPW. It then presents causal preference optimization (CPO) as an unbiased surrogate and its doubly robust extension (DR-CPO) that reduces estimator variance by integrating outcome modeling, with unbiasedness guaranteed under either known randomization or correct outcome modeling. Empirical results on Hate Speech and Hong Kong datasets show DR-CPO often yields higher expected rewards and robust performance under confounding, outperforming baselines such as OO-RLHF and standard fine-tuning. The paper demonstrates the practical viability of causal methods for aligning LLMs with human preferences using direct outcome data, and outlines future directions including entropy regularization and extensions to paired-data settings.

Abstract

Paper Structure (28 sections, 5 theorems, 28 equations, 2 figures, 4 tables)

This paper contains 28 sections, 5 theorems, 28 equations, 2 figures, 4 tables.

Introduction
Related Work
Language Model Optimization
Causal Inference and Doubly Robust Policy Learning
A Causal View of Language Model Optimization
(Doubly Robust) Causal Preference Optimization
Causal Preference Optimization
Doubly Robust Causal Preference Optimization
Relationship to Existing Approaches
Experiments
Datasets
Implementation
Results and Discussion
GPT-4 Annotation Validity
Outcome Optimization
...and 13 more sections

Key Result

Proposition 4.1

The value function $V(f)$ can be identified as

Figures (2)

Figure 1: CPO and DR-CPO win rates against OO-RLHF, FT, and one another other. A win rate exceeding 0.5 indicates that the named method outperforms the competing method with respect to the target outcome. The error bars correspond to 95% confidence intervals, and asterisks (*) mean that the win rate difference between the two methods is statistically significant at the 95% confidence level.
Figure 2: Impact of outcome model confounding (measured in win rate difference divided by 2) on OO-RLHF, CPO, and DR-CPO. A negative impact indicates that confounding hurts the performance of the method. The error bars correspond to 95% confidence intervals, and asterisks (*) mean that the win rate difference is statistically significant at the 95% confidence level.

Theorems & Definitions (10)

Proposition 4.1
Theorem 4.2
Proposition 4.3
Theorem 4.4
Proposition 4.5
proof
proof
proof
proof
proof

Optimizing Language Models for Human Preferences is a Causal Inference Problem

TL;DR

Abstract

Optimizing Language Models for Human Preferences is a Causal Inference Problem

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (10)