Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes

Katarzyna Kobalczyk; Claudio Fanconi; Hao Sun; Mihaela van der Schaar

Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes

Katarzyna Kobalczyk, Claudio Fanconi, Hao Sun, Mihaela van der Schaar

TL;DR

The paper tackles heterogeneity in user preferences for aligning LLMs by extending the Bradley-Terry-Luce framework to accommodate unobserved variability with Neural Processes. It introduces NP-BTL for stochastic reward modelling and NP-DPO for context-conditioned direct policy optimisation, enabling inference-time adaptation to individual user preferences from few-shot data. Empirical results on synthetic tasks and the UltraFeedback HH dataset demonstrate data-efficient capture of diverse behaviours and the ability to generate outputs across a continuum of behavioural modes. By avoiding the need for multiple objective-specific datasets or models, the approach advances steerable pluralistic alignment with practical implications for personalized AI agents, and code is released publicly.

Abstract

As large language models (LLMs) become increasingly embedded in everyday applications, ensuring their alignment with the diverse preferences of individual users has become a critical challenge. Currently deployed approaches typically assume homogeneous user objectives and rely on single-objective fine-tuning. However, human preferences are inherently heterogeneous, influenced by various unobservable factors, leading to conflicting signals in preference data. Existing solutions addressing this diversity often require costly datasets labelled for specific objectives and involve training multiple reward models or LLM policies, which is computationally expensive and impractical. In this work, we present a novel framework for few-shot steerable alignment, where users' underlying preferences are inferred from a small sample of their choices. To achieve this, we extend the Bradley-Terry-Luce model to handle heterogeneous preferences with unobserved variability factors and propose its practical implementation for reward modelling and LLM fine-tuning. Thanks to our proposed approach of functional parameter-space conditioning, LLMs trained with our framework can be adapted to individual preferences at inference time, generating outputs over a continuum of behavioural modes. We empirically validate the effectiveness of methods, demonstrating their ability to capture and align with diverse human preferences in a data-efficient manner. Our code is made available at: https://github.com/kasia-kobalczyk/few-shot-steerable-alignment.

Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes

TL;DR

Abstract

Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)

Theorems & Definitions (3)