Table of Contents
Fetching ...

Exploring Chain-of-Thought Reasoning for Steerable Pluralistic Alignment

Yunfan Zhang, Kathleen McKeown, Smaranda Muresan

TL;DR

The work addresses the problem of steering LLM outputs toward diverse human perspectives by evaluating chain-of-thought (CoT) based alignment methods. It formalizes steerable pluralism as selecting the most representative response $a_i = \mathop{\mathrm{argmax}}\limits_{a_i \in A} p(a_i \mid d, s)$ and systematically compares prompting, human/synthetic CoT fine-tuning, and RLVR across VK and OpinionQA using Llama 3 8B and Qwen2.5-7B. RLVR consistently delivers the strongest performance and notable sample efficiency, while offering insights into CoT faithfulness and safety, including a trade-off with pluralistic reasoning. The study contributes a comprehensive, reproducible evaluation of CoT-based methods for steerable pluralism and highlights practical considerations for deploying multi-perspective alignment in real-world applications. These findings advance the development of controllable, nuanced LLMs that can reflect diverse perspectives with manageable computational costs.

Abstract

Large Language Models (LLMs) are typically trained to reflect a relatively uniform set of values, which limits their applicability to tasks that require understanding of nuanced human perspectives. Recent research has underscored the importance of enabling LLMs to support steerable pluralism -- the capacity to adopt a specific perspective and align generated outputs with it. In this work, we investigate whether Chain-of-Thought (CoT) reasoning techniques can be applied to building steerable pluralistic models. We explore several methods, including CoT prompting, fine-tuning on human-authored CoT, fine-tuning on synthetic explanations, and Reinforcement Learning with Verifiable Rewards (RLVR). We evaluate these approaches using the Value Kaleidoscope and OpinionQA datasets. Among the methods studied, RLVR consistently outperforms others and demonstrates strong training sample efficiency. We further analyze the generated CoT traces with respect to faithfulness and safety.

Exploring Chain-of-Thought Reasoning for Steerable Pluralistic Alignment

TL;DR

The work addresses the problem of steering LLM outputs toward diverse human perspectives by evaluating chain-of-thought (CoT) based alignment methods. It formalizes steerable pluralism as selecting the most representative response and systematically compares prompting, human/synthetic CoT fine-tuning, and RLVR across VK and OpinionQA using Llama 3 8B and Qwen2.5-7B. RLVR consistently delivers the strongest performance and notable sample efficiency, while offering insights into CoT faithfulness and safety, including a trade-off with pluralistic reasoning. The study contributes a comprehensive, reproducible evaluation of CoT-based methods for steerable pluralism and highlights practical considerations for deploying multi-perspective alignment in real-world applications. These findings advance the development of controllable, nuanced LLMs that can reflect diverse perspectives with manageable computational costs.

Abstract

Large Language Models (LLMs) are typically trained to reflect a relatively uniform set of values, which limits their applicability to tasks that require understanding of nuanced human perspectives. Recent research has underscored the importance of enabling LLMs to support steerable pluralism -- the capacity to adopt a specific perspective and align generated outputs with it. In this work, we investigate whether Chain-of-Thought (CoT) reasoning techniques can be applied to building steerable pluralistic models. We explore several methods, including CoT prompting, fine-tuning on human-authored CoT, fine-tuning on synthetic explanations, and Reinforcement Learning with Verifiable Rewards (RLVR). We evaluate these approaches using the Value Kaleidoscope and OpinionQA datasets. Among the methods studied, RLVR consistently outperforms others and demonstrates strong training sample efficiency. We further analyze the generated CoT traces with respect to faithfulness and safety.

Paper Structure

This paper contains 14 sections, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Example from OpinionQA used in the steerable pluralism task, with abbreviated outputs from RLVR-aligned models.
  • Figure 3: An example from the VK dataset, along with outputs from our Llama 3 8B RLVR model. Text spans in support of option A are highlighted in green, and text spans in support of option B are highlighted in red. Given two different perspectives, our model correctly predicted the most appropriate option. It is also worth noting that in both responses, the model considered viewpoints from both sides, demonstrating value pluralism in the CoT.
  • Figure 4: An example from the OpinionQA dataset, along with outputs from our Llama 3 8B RLVR model. Text spans in support of option A (liberal view) are highlighted in blue, and text spans in support of option B (conservative view) are highlighted are red. Again, our model correctly predicted the most appropriate option given the perspective while considering the opinions from both sides, thereby demonstrating value pluralism during the reasoning process.
  • Figure 5: Prompt Template for VK dataset.
  • Figure 6: Prompt Template for OpinionQA dataset.
  • ...and 4 more figures