Table of Contents
Fetching ...

Aligning LLMs with Domain Invariant Reward Models

David Wu, Sanjiban Choudhury

TL;DR

This work tackles the challenge of aligning LLMs in domains lacking human preference data by proposing DIAL, a dual-loss framework that learns domain-invariant reward models through Wasserstein-distance-based domain alignment and a source-domain preference objective. The approach trains a base LM with a domain critic and a reward head to separate domain-specific signals from domain-agnostic reward concepts, enabling transfer from labeled source data to unlabeled target data. Theoretical bounds connect target performance to source performance and domain discrepancy, and extensive experiments demonstrate DIAL's effectiveness across cross-lingual, clean-to-noisy, few-shot-to-full, and simple-to-complex transfers, including analyses of embeddings, data scaling, and RLHF-shift adaptation. The results suggest that domain-invariant reward models can significantly improve scalable alignment of LLMs in resource-poor domains, with practical implications for broad, low-cost deployment of RLHF-based systems.

Abstract

Aligning large language models (LLMs) to human preferences is challenging in domains where preference data is unavailable. We address the problem of learning reward models for such target domains by leveraging feedback collected from simpler source domains, where human preferences are easier to obtain. Our key insight is that, while domains may differ significantly, human preferences convey \emph{domain-agnostic} concepts that can be effectively captured by a reward model. We propose \method, a framework that trains domain-invariant reward models by optimizing a dual loss: a domain loss that minimizes the divergence between source and target distribution, and a source loss that optimizes preferences on the source domain. We show \method is a general approach that we evaluate and analyze across 4 distinct settings: (1) Cross-lingual transfer (accuracy: $0.621 \rightarrow 0.661$), (2) Clean-to-noisy (accuracy: $0.671 \rightarrow 0.703$), (3) Few-shot-to-full transfer (accuracy: $0.845 \rightarrow 0.920$), and (4) Simple-to-complex tasks transfer (correlation: $0.508 \rightarrow 0.556$). Our code, models and data are available at \url{https://github.com/portal-cornell/dial}.

Aligning LLMs with Domain Invariant Reward Models

TL;DR

This work tackles the challenge of aligning LLMs in domains lacking human preference data by proposing DIAL, a dual-loss framework that learns domain-invariant reward models through Wasserstein-distance-based domain alignment and a source-domain preference objective. The approach trains a base LM with a domain critic and a reward head to separate domain-specific signals from domain-agnostic reward concepts, enabling transfer from labeled source data to unlabeled target data. Theoretical bounds connect target performance to source performance and domain discrepancy, and extensive experiments demonstrate DIAL's effectiveness across cross-lingual, clean-to-noisy, few-shot-to-full, and simple-to-complex transfers, including analyses of embeddings, data scaling, and RLHF-shift adaptation. The results suggest that domain-invariant reward models can significantly improve scalable alignment of LLMs in resource-poor domains, with practical implications for broad, low-cost deployment of RLHF-based systems.

Abstract

Aligning large language models (LLMs) to human preferences is challenging in domains where preference data is unavailable. We address the problem of learning reward models for such target domains by leveraging feedback collected from simpler source domains, where human preferences are easier to obtain. Our key insight is that, while domains may differ significantly, human preferences convey \emph{domain-agnostic} concepts that can be effectively captured by a reward model. We propose \method, a framework that trains domain-invariant reward models by optimizing a dual loss: a domain loss that minimizes the divergence between source and target distribution, and a source loss that optimizes preferences on the source domain. We show \method is a general approach that we evaluate and analyze across 4 distinct settings: (1) Cross-lingual transfer (accuracy: ), (2) Clean-to-noisy (accuracy: ), (3) Few-shot-to-full transfer (accuracy: ), and (4) Simple-to-complex tasks transfer (correlation: ). Our code, models and data are available at \url{https://github.com/portal-cornell/dial}.
Paper Structure (41 sections, 3 theorems, 20 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 41 sections, 3 theorems, 20 equations, 7 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Let $r$ be a $K$-Lipschitz function. Then the target domain error $\epsilon_T(r, f)$ satisfies: where $W_1(\mu_S, \mu_T)$ is the Wasserstein-1 distance between the source and target distributions $\mu_S$ and $\mu_T$ over $(x, y)$, and $L_\sigma=\frac{1}{4}$ is the Lipschitz constant of $\sigma$.

Figures (7)

  • Figure 1: DIAL trains domain-invariant reward model for target domains with no labeled preference data. DIAL leverages labeled source data and unlabeled target data to train reward models on a dual loss: a domain loss that minimizes the divergence between source and target distribution, and a source loss that optimizes preferences on the source domain. We show DIAL is a general approach that we evaluate and analyze across 4 distinct applications: (1) Cross-lingual transfer, (2) Clean-to-noisy, (3) Few-shot-to-full transfer, and (4) Simple-to-complex tasks transfer.
  • Figure 2: DIAL overview.DIAL takes labeled source and unlabeled target data and trains a domain-invariant reward model. The model takes prompt ($x$) and response ($y$), passes it through a base language model ($\theta$) with two heads: a domain critic head ($\psi$) and a reward head ($\phi$). The critic head is trained adversarially to minimize the Wasserstein distance between source and target embeddings while the reward head optimizes preferences on source data.
  • Figure 3: Reward model embeddings learned by DIAL and Src-Pref across training iterations on (a) Cross-lingual Transfer and (b) Few-shot-to-full Transfer. Src-Pref separates source embeddings, but not target embeddings, resulting in poor transfer. DIAL learns embeddings that cluster (source positive, target positive) and (source negative, target negative) allowing for better reward transfer.
  • Figure 4: Scaling with Data on legaladvice-Korean split. (a) DIAL performance with varying source-target data mix (b) DIAL scaling with unlabeled target data vs Src-Tgt-Pref scaling with labeled target data. Resuls on 3 seeds.
  • Figure 5: Spurious reward. Accuracy results on odd-one-out ($100$ datapoints) over 3 seeds. DIAL learns the correct reward, while Src-Pref learns spurious reward of "not source" which performs similar to random.
  • ...and 2 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Lemma 1
  • proof
  • Theorem 2
  • proof