Table of Contents
Fetching ...

Causal Direct Preference Optimization for Distributionally Robust Generative Recommendation

Chu Zhao, Enneng Yang, Jianzhe Zhao, Guibing Guo

Abstract

Direct Preference Optimization (DPO) guides large language models (LLMs) to generate recommendations aligned with user historical behavior distributions by minimizing preference alignment loss. However, our systematic empirical research and theoretical analysis reveal that DPO tends to amplify spurious correlations caused by environmental confounders during the alignment process, significantly undermining the generalization capability of LLM-based generative recommendation methods in out of distribution (OOD) scenarios. To mitigate this issue, we propose CausalDPO, an extension of DPO that incorporates a causal invariance learning mechanism. This method introduces a backdoor adjustment strategy during the preference alignment phase to eliminate interference from environmental confounders, explicitly models the latent environmental distribution using a soft clustering approach, and enhances robust consistency across diverse environments through invariance constraints. Theoretical analysis demonstrates that CausalDPO can effectively capture users stable preference structures across multiple environments, thereby improving the OOD generalization performance of LLM-based recommendation models. We conduct extensive experiments under four representative distribution shift settings to validate the effectiveness of CausalDPO, achieving an average performance improvement of 17.17% across four evaluation metrics.

Causal Direct Preference Optimization for Distributionally Robust Generative Recommendation

Abstract

Direct Preference Optimization (DPO) guides large language models (LLMs) to generate recommendations aligned with user historical behavior distributions by minimizing preference alignment loss. However, our systematic empirical research and theoretical analysis reveal that DPO tends to amplify spurious correlations caused by environmental confounders during the alignment process, significantly undermining the generalization capability of LLM-based generative recommendation methods in out of distribution (OOD) scenarios. To mitigate this issue, we propose CausalDPO, an extension of DPO that incorporates a causal invariance learning mechanism. This method introduces a backdoor adjustment strategy during the preference alignment phase to eliminate interference from environmental confounders, explicitly models the latent environmental distribution using a soft clustering approach, and enhances robust consistency across diverse environments through invariance constraints. Theoretical analysis demonstrates that CausalDPO can effectively capture users stable preference structures across multiple environments, thereby improving the OOD generalization performance of LLM-based recommendation models. We conduct extensive experiments under four representative distribution shift settings to validate the effectiveness of CausalDPO, achieving an average performance improvement of 17.17% across four evaluation metrics.
Paper Structure (32 sections, 5 theorems, 57 equations, 3 figures, 9 tables, 1 algorithm)

This paper contains 32 sections, 5 theorems, 57 equations, 3 figures, 9 tables, 1 algorithm.

Key Result

Proposition 3.1

In the context of LLMs, when the environmental confounder $E$ in the training data satisfies the following preference bias: the environmental feature $E$ is more likely to appear in the preferred outputs $y_w$, DPO tends to learn spurious correlations between $E$ and the preferred output $y_w$. During the maximum likelihood optimization of the policy $\pi_\theta$ on preference pairs, such spuriou

Figures (3)

  • Figure 1: Left: This figure presents the number of interactions (frequency) for DPO-based models across item popularity groups, with popularity decreasing from G1 (head) to G5 (tail). Middle: This figure illustrates how LLM-based recommendation models can learn and amplify spurious correlations during preference alignment, and how the DPO mechanism further reinforces these spurious correlations. Right: Using a Structural Causal Model (SCM), this figure analyzes how environmental confounders $E$ affect the model and demonstrates how their influence can be mitigated via the backdoor adjustment criterion.
  • Figure 2: Further study on performance under distribution shifts and clustering Visualization.
  • Figure 3: Evaluating the impact of model hyperparameter sets on recommendation performance.

Theorems & Definitions (6)

  • Proposition 3.1: DPO amplifies spurious correlations and hinders generalization capabilities
  • Proposition 3.2: Invariant and sufficient preference policy via CausalDPO.
  • Proposition 3.3: Generalization Bound for CausalDPO.
  • Proposition 1.2: Invariant and sufficient preference policy via CausalDPO.
  • Proposition 1.3: Generalization Bound for CausalDPO.
  • proof