Table of Contents
Fetching ...

Evaluating LLM Adaptation to Sociodemographic Factors: User Profile vs. Dialogue History

Qishuai Zhong, Zongmin Li, Siqi Fan, Aixin Sun

TL;DR

This work addresses how LLMs adapt outputs to users' sociodemographic contexts when attributes are provided explicitly in a prompt or implicitly via dialogue history. It introduces a two-format evaluation framework and an agent-based synthetic dataset aligned with profiles, using Hofstede’s Value Survey Module (VSM 2013) to probe value expression, quantified with $JSD$ across demographic groups and $EMD$ for cross-format consistency. The study evaluates multiple open-source LLMs, including reasoning-augmented models, finding that most models adjust expressed values with demographic changes—especially age and education—with larger, reasoning-enabled models showing stronger cross-format consistency, notably the QwQ-32B model. The results underscore the importance of reasoning capabilities in achieving robust sociodemographic adaptation and provide a privacy-preserving benchmark by releasing the synthetic dataset for future research. Overall, the framework offers a rigorous, controllable approach to assess cross-format adaptation relevant for real-world chatbot deployments.

Abstract

Effective engagement by large language models (LLMs) requires adapting responses to users' sociodemographic characteristics, such as age, occupation, and education level. While many real-world applications leverage dialogue history for contextualization, existing evaluations of LLMs' behavioral adaptation often focus on single-turn prompts. In this paper, we propose a framework to evaluate LLM adaptation when attributes are introduced either (1) explicitly via user profiles in the prompt or (2) implicitly through multi-turn dialogue history. We assess the consistency of model behavior across these modalities. Using a multi-agent pipeline, we construct a synthetic dataset pairing dialogue histories with distinct user profiles and employ questions from the Value Survey Module (VSM 2013) (Hofstede and Hofstede, 2016) to probe value expression. Our findings indicate that most models adjust their expressed values in response to demographic changes, particularly in age and education level, but consistency varies. Models with stronger reasoning capabilities demonstrate greater alignment, indicating the importance of reasoning in robust sociodemographic adaptation.

Evaluating LLM Adaptation to Sociodemographic Factors: User Profile vs. Dialogue History

TL;DR

This work addresses how LLMs adapt outputs to users' sociodemographic contexts when attributes are provided explicitly in a prompt or implicitly via dialogue history. It introduces a two-format evaluation framework and an agent-based synthetic dataset aligned with profiles, using Hofstede’s Value Survey Module (VSM 2013) to probe value expression, quantified with across demographic groups and for cross-format consistency. The study evaluates multiple open-source LLMs, including reasoning-augmented models, finding that most models adjust expressed values with demographic changes—especially age and education—with larger, reasoning-enabled models showing stronger cross-format consistency, notably the QwQ-32B model. The results underscore the importance of reasoning capabilities in achieving robust sociodemographic adaptation and provide a privacy-preserving benchmark by releasing the synthetic dataset for future research. Overall, the framework offers a rigorous, controllable approach to assess cross-format adaptation relevant for real-world chatbot deployments.

Abstract

Effective engagement by large language models (LLMs) requires adapting responses to users' sociodemographic characteristics, such as age, occupation, and education level. While many real-world applications leverage dialogue history for contextualization, existing evaluations of LLMs' behavioral adaptation often focus on single-turn prompts. In this paper, we propose a framework to evaluate LLM adaptation when attributes are introduced either (1) explicitly via user profiles in the prompt or (2) implicitly through multi-turn dialogue history. We assess the consistency of model behavior across these modalities. Using a multi-agent pipeline, we construct a synthetic dataset pairing dialogue histories with distinct user profiles and employ questions from the Value Survey Module (VSM 2013) (Hofstede and Hofstede, 2016) to probe value expression. Our findings indicate that most models adjust their expressed values in response to demographic changes, particularly in age and education level, but consistency varies. Models with stronger reasoning capabilities demonstrate greater alignment, indicating the importance of reasoning in robust sociodemographic adaptation.

Paper Structure

This paper contains 23 sections, 4 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: We evaluate whether the model can adjust response values according to identical user attributes presented in different formats, and assess the consistency across these formats.
  • Figure 2: Dataset generation framework architecture. Each iteration: (i) user_simulator LLM is queried to generate a question simulating the user's perspective based on their profile, (ii) out-of-context detector validates the question to ensure consistency with the user's profile, and (iii) qa_llm responds to the question.
  • Figure 3: Model querying workflow with key components. Here, $u$ denotes a user profile, and $d$ is a synthetic dialogue. Each line $r\in R$ represents the response to a VSM question, which includes a normalized probability distribution $P$ over the 5 option_ids.
  • Figure 4: Mean probability of the “selected_option_id” in BA_user and BA_dialogue, reflecting model confidence. Most models show similar decisiveness across both scenarios.
  • Figure 5: The measurement results for BA_user and BA_dialogue are shown below. The first row compares groups by "Age," while the second row presents results for "Education Level." Most models exhibit a positive correlation between computed distances and demographic differences.
  • ...and 11 more figures