Table of Contents
Fetching ...

PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

Han Jiang, Dongyao Zhu, Zhihua Wei, Xiaoyuan Yi, Ziang Xiao, Xing Xie

TL;DR

PICACO tackles the Instruction Bottleneck in pluralistic in-context value alignment by introducing a training-free meta-instruction optimization via total correlation, formalized as $TC_{m e}(m V,m y|m x)$. The method alternates between enhancing aligned responses and refining the meta-instruction using a variational information objective, leveraging $q_{m req}$ and a redundancy metric $q_{meta}$ to promote multi-value conformity while avoiding superficial copying. Across five value compositions and multiple LLMs, PICACO outperforms baselines, showing robust steerability, resistance to jailbreak prompts, and adaptability to growing value sets without requiring model fine-tuning. The approach offers a scalable, model-agnostic path to pluralistic value alignment with practical implications for safer, more responsible AI systems.

Abstract

In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions--human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that navigates multiple values to better elicit LLMs' understanding of them and improve their alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, theoretically reinforcing value correlation while reducing distractive noise, resulting in effective value instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.

PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

TL;DR

PICACO tackles the Instruction Bottleneck in pluralistic in-context value alignment by introducing a training-free meta-instruction optimization via total correlation, formalized as . The method alternates between enhancing aligned responses and refining the meta-instruction using a variational information objective, leveraging and a redundancy metric to promote multi-value conformity while avoiding superficial copying. Across five value compositions and multiple LLMs, PICACO outperforms baselines, showing robust steerability, resistance to jailbreak prompts, and adaptability to growing value sets without requiring model fine-tuning. The approach offers a scalable, model-agnostic path to pluralistic value alignment with practical implications for safer, more responsible AI systems.

Abstract

In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions--human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that navigates multiple values to better elicit LLMs' understanding of them and improve their alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, theoretically reinforcing value correlation while reducing distractive noise, resulting in effective value instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.

Paper Structure

This paper contains 80 sections, 23 equations, 13 figures, 17 tables, 1 algorithm.

Figures (13)

  • Figure 1: GPT-4o's responses when instructed to follow multiple helpful and harmless requirements (top) and Schwartz Basic Human Value dimensions (bottom). In both cases, some of the specified values are disregarded.
  • Figure 2: An illustration of PICACO. PICACO alternates between the following two steps: 1) in the Response Enhancement step, PICACO samples responses with current best meta-instruction $\bm e^{t-1}$ and updates both the aligned response pool and the regularization response pool according to $q_{\bm \omega}$ and $q_{\bm \phi}$; 2) in the Instruction Refinement step, it searches for the meta-instruction that maximizes $\text{TC}_e$ given the current response pools, $\{\bm y_{i,j}^t\}_{j=1}^{M_1}, \{\hat{\bm y}_{i,j}^t\}_{j=1}^{M_2}$.
  • Figure 3: (a) Negligible changes in overall conformity brought by using GPT-4o for meta-instruction sampling. (b) Overall conformity of Q, Q+IF, OPRO, and PICACO with the two LLMs in Table \ref{['tab:main_results']} and O4-Mini on the two Schwartz value compositions. (c) Proportions of queries with average Helpfulness$\ge$ 3 or average Toxicity$\ge$ 3 when aligning GPT-3.5-Turbo to the Harmlessness composition under the jailbreak attack.
  • Figure 4: Conformity score statistics of Q+IF, Modular Pluralism, and PICACO across four numbers of HH values.
  • Figure 5: GPT-3.5-Turbo's continuation of "An unpopular opinion..." when aligned with two Schwartz values, Tradition and Hedonism, using PICACO, Q+IF, and Modular Pluralism.
  • ...and 8 more figures