Table of Contents
Fetching ...

Test-Time Alignment via Hypothesis Reweighting

Yoonho Lee, Jonathan Williams, Henrik Marklund, Archit Sharma, Eric Mitchell, Anikait Singh, Chelsea Finn

TL;DR

This work addresses underspecification in large pretrained models by introducing HyRe, a test-time framework that rapidly aligns model behavior to target user intent through reweighting an efficient ensemble of heads trained on a shared backbone. At inference, HyRe uses a small labeled adaptation set $ ext{$ ext{D}_{ ext{adapt}}$}$ to assign weights $w_k$ to ensemble members via $w_k = \frac{\exp(-\mathcal{L}(f_k,\text{D}_{\text{adapt}}))}{\sum_i \exp(-\mathcal{L}(f_i,\text{D}_{\text{adapt}}))}$ and forms $f_w(x)=\sum_k w_k f_k(x)$, which is interpreted through a generalized Bayesian lens with $ ext{$\pi(w|\text{D}_{\text{adapt}})$} \propto \exp(-\mathcal{L}(w,\text{D}_{\text{adapt}})) \pi(w)$. HyRe scales to large models with negligible per-instance overhead and requires only a handful of labels to surpass prior state-of-the-art reward models across 18 distributions. Across regression shifts, natural distribution shifts, and personalization tasks, HyRe enables rapid, on-the-fly alignment without full retraining, underscoring the value of test-time task specification via diverse ensembles and suggesting paths toward richer mixture-of-experts and active-label strategies.

Abstract

Large pretrained models often struggle with underspecified tasks -- situations where the training data does not fully define the desired behavior. For example, chatbots must handle diverse and often conflicting user preferences, requiring adaptability to various user needs. We propose a novel framework to address the general challenge of aligning models to test-time user intent, which is rarely fully specified during training. Our approach involves training an efficient ensemble, i.e., a single neural network with multiple prediction heads, each representing a different function consistent with the training data. Our main contribution is HyRe, a simple adaptation technique that dynamically reweights ensemble members at test time using a small set of labeled examples from the target distribution, which can be labeled in advance or actively queried from a larger unlabeled pool. By leveraging recent advances in scalable ensemble training, our method scales to large pretrained models, with computational costs comparable to fine-tuning a single model. We empirically validate HyRe in several underspecified scenarios, including personalization tasks and settings with distribution shifts. Additionally, with just five preference pairs from each target distribution, the same ensemble adapted via HyRe outperforms the prior state-of-the-art 2B-parameter reward model accuracy across 18 evaluation distributions.

Test-Time Alignment via Hypothesis Reweighting

TL;DR

This work addresses underspecification in large pretrained models by introducing HyRe, a test-time framework that rapidly aligns model behavior to target user intent through reweighting an efficient ensemble of heads trained on a shared backbone. At inference, HyRe uses a small labeled adaptation set ext{D}_{ ext{adapt}} to assign weights to ensemble members via and forms , which is interpreted through a generalized Bayesian lens with \pi(w|\text{D}_{\text{adapt}}). HyRe scales to large models with negligible per-instance overhead and requires only a handful of labels to surpass prior state-of-the-art reward models across 18 distributions. Across regression shifts, natural distribution shifts, and personalization tasks, HyRe enables rapid, on-the-fly alignment without full retraining, underscoring the value of test-time task specification via diverse ensembles and suggesting paths toward richer mixture-of-experts and active-label strategies.

Abstract

Large pretrained models often struggle with underspecified tasks -- situations where the training data does not fully define the desired behavior. For example, chatbots must handle diverse and often conflicting user preferences, requiring adaptability to various user needs. We propose a novel framework to address the general challenge of aligning models to test-time user intent, which is rarely fully specified during training. Our approach involves training an efficient ensemble, i.e., a single neural network with multiple prediction heads, each representing a different function consistent with the training data. Our main contribution is HyRe, a simple adaptation technique that dynamically reweights ensemble members at test time using a small set of labeled examples from the target distribution, which can be labeled in advance or actively queried from a larger unlabeled pool. By leveraging recent advances in scalable ensemble training, our method scales to large pretrained models, with computational costs comparable to fine-tuning a single model. We empirically validate HyRe in several underspecified scenarios, including personalization tasks and settings with distribution shifts. Additionally, with just five preference pairs from each target distribution, the same ensemble adapted via HyRe outperforms the prior state-of-the-art 2B-parameter reward model accuracy across 18 evaluation distributions.

Paper Structure

This paper contains 26 sections, 9 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: The ensemble average is suboptimal in underspecified tasks. Performance of the uniform ensemble vs. the best individual model across four underspecified tasks (lower is better). In all cases, the best single head outperforms the uniform ensemble on the target distribution, highlighting the need for approaches that utilize additional information about the target distribution to optimize ensemble weighting.
  • Figure 1: Root Mean squared error (RMSE) on test data with distribution shifts across three UCI datasets. We compare the performance various ensemble architectures with test-time adaptation using HyRe. We find that HyRe consistently improves the performance of all model architectures.
  • Figure 2: Principal component analysis of an ensemble of regression models. (Left) The ensemble of functions, with each gray line representing one function. The dashed line shows the (average) ensemble prediction. (Right) The first three principal components of the ensemble's predictions. Each principal component reflects a distinct functional variation.
  • Figure 3: Performance of HyRe vs fine-tuning at different amounts of adaptation data. Ensemble reweighting outperforms fine-tuning in the low-data regime.
  • Figure 4: Visualization of an ensemble model trained on data with conflicting labels. (Left) The training dataset is labeled by multiple labelers with conflicting preferences, introducing ambiguity. (Center) The average predictions of an ensemble capture the "average labeler", resulting in smooth decision boundaries that blend the conflicting input. (Right) Increasing diversity leads to a population with higher maximum agreement with a held-out labeler.
  • ...and 5 more figures