Test-Time Alignment via Hypothesis Reweighting

Yoonho Lee; Jonathan Williams; Henrik Marklund; Archit Sharma; Eric Mitchell; Anikait Singh; Chelsea Finn

Test-Time Alignment via Hypothesis Reweighting

Yoonho Lee, Jonathan Williams, Henrik Marklund, Archit Sharma, Eric Mitchell, Anikait Singh, Chelsea Finn

TL;DR

This work addresses underspecification in large pretrained models by introducing HyRe, a test-time framework that rapidly aligns model behavior to target user intent through reweighting an efficient ensemble of heads trained on a shared backbone. At inference, HyRe uses a small labeled adaptation set $ ext{$ ext{D}_{ ext{adapt}}$}$ to assign weights $w_k$ to ensemble members via $w_k = \frac{\exp(-\mathcal{L}(f_k,\text{D}_{\text{adapt}}))}{\sum_i \exp(-\mathcal{L}(f_i,\text{D}_{\text{adapt}}))}$ and forms $f_w(x)=\sum_k w_k f_k(x)$, which is interpreted through a generalized Bayesian lens with $ ext{$\pi(w|\text{D}_{\text{adapt}})$} \propto \exp(-\mathcal{L}(w,\text{D}_{\text{adapt}})) \pi(w)$. HyRe scales to large models with negligible per-instance overhead and requires only a handful of labels to surpass prior state-of-the-art reward models across 18 distributions. Across regression shifts, natural distribution shifts, and personalization tasks, HyRe enables rapid, on-the-fly alignment without full retraining, underscoring the value of test-time task specification via diverse ensembles and suggesting paths toward richer mixture-of-experts and active-label strategies.

Abstract

Large pretrained models often struggle with underspecified tasks -- situations where the training data does not fully define the desired behavior. For example, chatbots must handle diverse and often conflicting user preferences, requiring adaptability to various user needs. We propose a novel framework to address the general challenge of aligning models to test-time user intent, which is rarely fully specified during training. Our approach involves training an efficient ensemble, i.e., a single neural network with multiple prediction heads, each representing a different function consistent with the training data. Our main contribution is HyRe, a simple adaptation technique that dynamically reweights ensemble members at test time using a small set of labeled examples from the target distribution, which can be labeled in advance or actively queried from a larger unlabeled pool. By leveraging recent advances in scalable ensemble training, our method scales to large pretrained models, with computational costs comparable to fine-tuning a single model. We empirically validate HyRe in several underspecified scenarios, including personalization tasks and settings with distribution shifts. Additionally, with just five preference pairs from each target distribution, the same ensemble adapted via HyRe outperforms the prior state-of-the-art 2B-parameter reward model accuracy across 18 evaluation distributions.

Test-Time Alignment via Hypothesis Reweighting

TL;DR

Abstract

Test-Time Alignment via Hypothesis Reweighting

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)