Table of Contents
Fetching ...

Improving and Assessing the Fidelity of Large Language Models Alignment to Online Communities

Minh Duc Chu, Zihao He, Rebecca Dorn, Kristina Lerman

TL;DR

The paper tackles the challenge of faithfully aligning large language models to online communities and rigorously assessing fidelity across multiple linguistic dimensions. It introduces an unsupervised, scalable pipeline that constructs instruction-response demonstrations from community data, finetunes LLMs (e.g., Llama-3) to mimic the target discourse, and generates synthetic corpora for evaluation along authenticity, emotional tone, toxicity, and harm. The authors validate the approach through a case study on dieting and body-image communities, showing that finetuned models better replicate community language and harm profiles than in-context baselines, and demonstrate potential for automated moderation and public-health insights via ED screening instruments. The work highlights practical implications for social science research and platform safety while acknowledging limitations related to dataset bias, temporal shifts, artifacts from synthetic data, and ethical considerations surrounding harm assessment and diagnosis. Overall, it provides a scalable framework to construct high-fidelity digital representations of online communities and to leverage them for monitoring, research, and policy support in sensitive domains like eating disorders.

Abstract

Large language models (LLMs) have shown promise in representing individuals and communities, offering new ways to study complex social dynamics. However, effectively aligning LLMs with specific human groups and systematically assessing the fidelity of the alignment remains a challenge. This paper presents a robust framework for aligning LLMs with online communities via instruction-tuning and comprehensively evaluating alignment across various aspects of language, including authenticity, emotional tone, toxicity, and harm. We demonstrate the utility of our approach by applying it to online communities centered on dieting and body image. We administer an eating disorder psychometric test to the aligned LLMs to reveal unhealthy beliefs and successfully differentiate communities with varying levels of eating disorder risk. Our results highlight the potential of LLMs in automated moderation and broader applications in public health and social science research.

Improving and Assessing the Fidelity of Large Language Models Alignment to Online Communities

TL;DR

The paper tackles the challenge of faithfully aligning large language models to online communities and rigorously assessing fidelity across multiple linguistic dimensions. It introduces an unsupervised, scalable pipeline that constructs instruction-response demonstrations from community data, finetunes LLMs (e.g., Llama-3) to mimic the target discourse, and generates synthetic corpora for evaluation along authenticity, emotional tone, toxicity, and harm. The authors validate the approach through a case study on dieting and body-image communities, showing that finetuned models better replicate community language and harm profiles than in-context baselines, and demonstrate potential for automated moderation and public-health insights via ED screening instruments. The work highlights practical implications for social science research and platform safety while acknowledging limitations related to dataset bias, temporal shifts, artifacts from synthetic data, and ethical considerations surrounding harm assessment and diagnosis. Overall, it provides a scalable framework to construct high-fidelity digital representations of online communities and to leverage them for monitoring, research, and policy support in sensitive domains like eating disorders.

Abstract

Large language models (LLMs) have shown promise in representing individuals and communities, offering new ways to study complex social dynamics. However, effectively aligning LLMs with specific human groups and systematically assessing the fidelity of the alignment remains a challenge. This paper presents a robust framework for aligning LLMs with online communities via instruction-tuning and comprehensively evaluating alignment across various aspects of language, including authenticity, emotional tone, toxicity, and harm. We demonstrate the utility of our approach by applying it to online communities centered on dieting and body image. We administer an eating disorder psychometric test to the aligned LLMs to reveal unhealthy beliefs and successfully differentiate communities with varying levels of eating disorder risk. Our results highlight the potential of LLMs in automated moderation and broader applications in public health and social science research.
Paper Structure (53 sections, 10 figures, 9 tables)

This paper contains 53 sections, 10 figures, 9 tables.

Figures (10)

  • Figure 1: An example of a demonstration from a pro-eating disorder community, where the response is a tweet from the community.
  • Figure 2: The framework of our method. (1) We align an LLM (Llama-3) to an online community by finetuning the LLM to follow instructions on the task of generating tweets written by users in the community. (2) To prove the effectiveness of alignment, we compare three tweet corpora for each community: human-written tweets $D_i$, LLM-Context-generated tweets $D^{context}_i$, and finetuned LLM-generated tweets $D^{ft}_i$. We show that $D^{ft}_i$ is closer to $D_i$ than $D^{context}_i$ is, along the following aspects: (a) A classifier trained to classify the tweet origin (what community the tweet belongs to) on $\mathbb{D}=\{D_i\}_{i=1}^{n}$ performs better on $\mathbb{D}^{ft}=\{D^{ft}_i\}_{i=1}^{n}$, than on $\mathbb{D}^{context}=\{D^{context}_i\}_{i=1}^{n}$; (b) the emotion and toxicity distributions of $D^{ft}_i$ are much closer to that of $D_i$ compared to $D^{context}_i$; (c) the semantic embeddings of $D^{ft}_i$ are closer to that of $D_i$ in the embedding space than that of $D^{context}_i$ are; (d) a human annotator decides that $D^{ft}_i$ is more aligned to the underlying distribution of $D_i$ than $D^{context}_i$ is; (e) two ED experts determine that $D^{ft}_i$ carries harmful narratives that are more similar to $D_i$ than $D^{context}_i$ does. (3) As the LLM is aligned with the community and can speak in the voice of that community, we administer an ED questionnaire to screen the community for EDs.
  • Figure 3: Emotional agreement (a) between human-written tweets and LLM-Context-generated tweets, and (b) between human-written tweets and finetuned LLM-generated tweets. The differences in the emotional alignment between pairs within each community are statistically significant at a 95% confidence level.
  • Figure 4: Toxicity distributions across different communities of human-written tweets, LLM-Context-generated tweets, and finetuned LLM-generated tweets.
  • Figure 5: Distribution of the three fine-grained harm categories
  • ...and 5 more figures