Table of Contents
Fetching ...

Human Preferences for Constructive Interactions in Language Model Alignment

Yara Kyrychenko, Jon Roozenbeek, Brandon Davidson, Sander van der Linden, Ramit Debnath

TL;DR

This work addresses how to align LLMs toward constructive dialogue by analyzing a multicultural, individualized dataset of over $7{,}500$ human–LLM conversations across 74 countries and 21 models. It uses mixed-effects Bayesian regressions to link linguistic attributes in human prompts and responses to preference scores and final model outputs, revealing that reasoning and nuance robustly drive higher human preferences while personal storytelling and curiosity often reduce them. The findings also show that toxicity can paradoxically boost top scores in some contexts and that user prompting can steer model behavior, with moderation revealing demographic and value-alignment factors that shape these preferences. Overall, the paper informs alignment strategies by highlighting when and how users can influence constructive engagement, while cautioning about risks of toxicity-driven feedback loops in personalized alignment.

Abstract

As large language models (LLMs) enter the mainstream, aligning them to foster constructive dialogue rather than exacerbate societal divisions is critical. Using an individualized and multicultural alignment dataset of over 7,500 conversations of individuals from 74 countries engaging with 21 LLMs, we examined how linguistic attributes linked to constructive interactions are reflected in human preference data used for training AI. We found that users consistently preferred well-reasoned and nuanced responses while rejecting those high in personal storytelling. However, users who believed that AI should reflect their values tended to place less preference on reasoning in LLM responses and more on curiosity. Encouragingly, we observed that users could set the tone for how constructive their conversation would be, as LLMs mirrored linguistic attributes, including toxicity, in user queries.

Human Preferences for Constructive Interactions in Language Model Alignment

TL;DR

This work addresses how to align LLMs toward constructive dialogue by analyzing a multicultural, individualized dataset of over human–LLM conversations across 74 countries and 21 models. It uses mixed-effects Bayesian regressions to link linguistic attributes in human prompts and responses to preference scores and final model outputs, revealing that reasoning and nuance robustly drive higher human preferences while personal storytelling and curiosity often reduce them. The findings also show that toxicity can paradoxically boost top scores in some contexts and that user prompting can steer model behavior, with moderation revealing demographic and value-alignment factors that shape these preferences. Overall, the paper informs alignment strategies by highlighting when and how users can influence constructive engagement, while cautioning about risks of toxicity-driven feedback loops in personalized alignment.

Abstract

As large language models (LLMs) enter the mainstream, aligning them to foster constructive dialogue rather than exacerbate societal divisions is critical. Using an individualized and multicultural alignment dataset of over 7,500 conversations of individuals from 74 countries engaging with 21 LLMs, we examined how linguistic attributes linked to constructive interactions are reflected in human preference data used for training AI. We found that users consistently preferred well-reasoned and nuanced responses while rejecting those high in personal storytelling. However, users who believed that AI should reflect their values tended to place less preference on reasoning in LLM responses and more on curiosity. Encouragingly, we observed that users could set the tone for how constructive their conversation would be, as LLMs mirrored linguistic attributes, including toxicity, in user queries.

Paper Structure

This paper contains 4 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Posterior distributions of coefficients from the mixed-effects Bayesian logistic and linear regressions, with 95% credible intervals, for the highest score (a) and score (b) of LLM responses. (c) Coefficients of mixed-effects Bayesian linear regressions predicting LLM response attributes based on human query attributes (higher estimates indicate a higher likelihood of response exhibiting the attribute).
  • Figure 2: Posterior distributions of coefficients from the mixed-effects Bayesian logistic and linear regressions, with 95% credible intervals, for user the highest score (panels a, b, and c) and score (panels d, e, and f) of LLM responses. Probabilities of bridging attributes (g) and toxicity (h) across three conversation categories.