Table of Contents
Fetching ...

Group Preference Optimization: Few-Shot Alignment of Large Language Models

Siyan Zhao, John Dang, Aditya Grover

TL;DR

Problem: efficiently aligning LLM outputs to diverse group preferences without extensive per-group data. Approach: Group Preference Optimization (GPO), a few-shot framework that augments a base LLM with a transformer predictor trained via in-context meta-learning on multiple groups, enabling unpublished-group adaptation from few examples. Findings: GPO achieves higher alignment scores than prompting and fine-tuning baselines on OpinionQA and GlobalOpinionQA across Alpaca-7B and Llama2-13B, with superior sample efficiency and lower compute than In-Context Finetune. Significance: enables scalable, personalized alignment of LLMs to a broad spectrum of groups while highlighting ethical considerations and areas for extending to long-form output and broader group inclusion.

Abstract

Many applications of large language models (LLMs), ranging from chatbots to creative writing, require nuanced subjective judgments that can differ significantly across different groups. Existing alignment algorithms can be expensive to align for each group, requiring prohibitive amounts of group-specific preference data and computation for real-world use cases. We introduce Group Preference Optimization (GPO), an alignment framework that steers language models to preferences of individual groups in a few-shot manner. In GPO, we augment the base LLM with an independent transformer module trained to predict the preferences of a group for the LLM generations. For few-shot learning, we parameterize this module as an in-context autoregressive transformer and train it via meta-learning on several groups. We empirically validate the efficacy of GPO through rigorous evaluations using LLMs with varied sizes on three human opinion adaptation tasks. These tasks involve adapting to the preferences of US demographic groups, global countries, and individual users. Our results demonstrate that GPO not only aligns models more accurately but also requires fewer group-specific preferences, and less training and inference computing resources, outperforming existing strategies such as in-context steering and fine-tuning methods.

Group Preference Optimization: Few-Shot Alignment of Large Language Models

TL;DR

Problem: efficiently aligning LLM outputs to diverse group preferences without extensive per-group data. Approach: Group Preference Optimization (GPO), a few-shot framework that augments a base LLM with a transformer predictor trained via in-context meta-learning on multiple groups, enabling unpublished-group adaptation from few examples. Findings: GPO achieves higher alignment scores than prompting and fine-tuning baselines on OpinionQA and GlobalOpinionQA across Alpaca-7B and Llama2-13B, with superior sample efficiency and lower compute than In-Context Finetune. Significance: enables scalable, personalized alignment of LLMs to a broad spectrum of groups while highlighting ethical considerations and areas for extending to long-form output and broader group inclusion.

Abstract

Many applications of large language models (LLMs), ranging from chatbots to creative writing, require nuanced subjective judgments that can differ significantly across different groups. Existing alignment algorithms can be expensive to align for each group, requiring prohibitive amounts of group-specific preference data and computation for real-world use cases. We introduce Group Preference Optimization (GPO), an alignment framework that steers language models to preferences of individual groups in a few-shot manner. In GPO, we augment the base LLM with an independent transformer module trained to predict the preferences of a group for the LLM generations. For few-shot learning, we parameterize this module as an in-context autoregressive transformer and train it via meta-learning on several groups. We empirically validate the efficacy of GPO through rigorous evaluations using LLMs with varied sizes on three human opinion adaptation tasks. These tasks involve adapting to the preferences of US demographic groups, global countries, and individual users. Our results demonstrate that GPO not only aligns models more accurately but also requires fewer group-specific preferences, and less training and inference computing resources, outperforming existing strategies such as in-context steering and fine-tuning methods.
Paper Structure (35 sections, 5 equations, 15 figures, 5 tables, 1 algorithm)

This paper contains 35 sections, 5 equations, 15 figures, 5 tables, 1 algorithm.

Figures (15)

  • Figure 1: Overview of GPO. Left: We adopt a general definition of a group to refer to any collection of agents (e.g., demographic groups, individual personas). Each group has its distinct preference toward a completion, which comprises a prompt and a response $(q, r)$, and each group exhibits a distribution of preferences over a range of completions. Group alignment aims to steer pretrained LLMs to preferences catering to a wide range of groups. For each group $g$, we represent its preference dataset as $\mathcal{D}_g = \{(x^g_1, y^g_1), \ldots, (x^g_n, y^g_n)\}$. Here, $y^g_i$ signifies the preference of group $g$ for a pair of given prompt $q^g_i$ and response $r^g_i$, while $x^g_i$ is its LLM representation obtained with $\pi_{\text{emb}}(q^g_i, r^g_i)$. Right: After being trained through meta-learning, GPO provides a few-shot framework for aligning any base LLM to any unseen test group (e.g., group $e$) given a small amount of in-context preference data without fine-tuning, enabling inference-time personalization.
  • Figure 2: Illustration of the GPO architecture for a sequence of $n$ points, with $m$ context points and $n-m$ target points. The context $(x_{1:m}, y_{1:m})$ serves as few-shot conditioning for GPO. GPO processes the full sequence using a transformer and predicts the preference scores $\hat{y}_{m+1:n}$.
  • Figure 3: Alignment score comparisons on the OpinionQA dataset and GlobalOpinionQA dataset with Alpaca-7b and Llama2-13b-chat as base models. Results have been averaged across three group split setups and three random seeds, with standard deviations provided.
  • Figure 4: Qualitative comparison of GPO alignment with steered LMs, where each pie chart denotes the preference distribution of the group. Here, GPO uses Alpaca-7b's embedding.
  • Figure 5: Alignment score of various methods based on Llama2-13B with varying group context sample size. Evaluation conducted on survey questions for Nigeria from the GlobalOpinionQA dataset. The shaded region represents the standard deviation across three different seed results.
  • ...and 10 more figures