Table of Contents
Fetching ...

COMMUNITY-CROSS-INSTRUCT: Unsupervised Instruction Generation for Aligning Large Language Models to Online Communities

Zihao He, Minh Duc Chu, Rebecca Dorn, Siyi Guo, Kristina Lerman

TL;DR

This work introduces Community-Cross-Instruct, an unsupervised framework for aligning LLMs to online communities to elicit their beliefs, and demonstrates the method’s utility in accurately representing political and diet communities on Reddit.

Abstract

Social scientists use surveys to probe the opinions and beliefs of populations, but these methods are slow, costly, and prone to biases. Recent advances in large language models (LLMs) enable the creating of computational representations or "digital twins" of populations that generate human-like responses mimicking the population's language, styles, and attitudes. We introduce Community-Cross-Instruct, an unsupervised framework for aligning LLMs to online communities to elicit their beliefs. Given a corpus of a community's online discussions, Community-Cross-Instruct automatically generates instruction-output pairs by an advanced LLM to (1) finetune a foundational LLM to faithfully represent that community, and (2) evaluate the alignment of the finetuned model to the community. We demonstrate the method's utility in accurately representing political and diet communities on Reddit. Unlike prior methods requiring human-authored instructions, Community-Cross-Instruct generates instructions in a fully unsupervised manner, enhancing scalability and generalization across domains. This work enables cost-effective and automated surveying of diverse online communities.

COMMUNITY-CROSS-INSTRUCT: Unsupervised Instruction Generation for Aligning Large Language Models to Online Communities

TL;DR

This work introduces Community-Cross-Instruct, an unsupervised framework for aligning LLMs to online communities to elicit their beliefs, and demonstrates the method’s utility in accurately representing political and diet communities on Reddit.

Abstract

Social scientists use surveys to probe the opinions and beliefs of populations, but these methods are slow, costly, and prone to biases. Recent advances in large language models (LLMs) enable the creating of computational representations or "digital twins" of populations that generate human-like responses mimicking the population's language, styles, and attitudes. We introduce Community-Cross-Instruct, an unsupervised framework for aligning LLMs to online communities to elicit their beliefs. Given a corpus of a community's online discussions, Community-Cross-Instruct automatically generates instruction-output pairs by an advanced LLM to (1) finetune a foundational LLM to faithfully represent that community, and (2) evaluate the alignment of the finetuned model to the community. We demonstrate the method's utility in accurately representing political and diet communities on Reddit. Unlike prior methods requiring human-authored instructions, Community-Cross-Instruct generates instructions in a fully unsupervised manner, enhancing scalability and generalization across domains. This work enables cost-effective and automated surveying of diverse online communities.
Paper Structure (45 sections, 7 figures, 9 tables)

This paper contains 45 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Illustration of Community-Cross-Instruct to align an LLM to a community. (1) Open-ended instructions and multi-choice survey questions are generated by an advanced LLM from the community data. (2) A foundational LLM is aligned to the community through instruction-tuning on the open-ended instructions. (3) The alignment of the finetuned LLM to the community is measured using the generated survey questions.
  • Figure 2: Example of (a) an instruction from CommInst and (b) a survey question from CommSurvey in the politics domain on the topic of marijuana. The open-ended instruction and survey question are paired with answers from different communities.
  • Figure 3: Overview of Community-Cross-Instruct, with an illustrative example of the politics domain. (1) Data is collected for each community within the desired domain. (2) BERTopic clusters the data and identifies prominent topics. A chunk is a set of documents from a community on the same topic. Chunk$_{i,j}$ represents the chunk from community $i$ on topic $j$. (3) For each topic, the advanced LLM is prompted with (i) on-topic chunks from each community and (ii) task definition of the instructional data generation (see Appendix \ref{['app:prompt_temp_oe']}), which leads the LLM to generate (a) open-ended instruction-response pairs and (b) multi-choice question-answer pairs. R$_{i,k}$ represents the response of community $i$ to instruction $k$; A$_{i,k}$ represents the answer of community $i$ to question $k$. (4) The open-ended instructions across all topics, along with the corresponding responses of community $i$, are added to CommInst$_i$, which is used to finetune a foundational LLM, to align the LLM to the community. (5) The multi-choice questions across all topics, along with the corresponding answers from community $i$, are added to CommSurvey$_i$, which is used to evaluate the finetuned LLM.
  • Figure 4: Prompting template to generate CommInst and CommSurvey in the politics domain.
  • Figure 5: Pairwise agreement between different communities, measured by Cohen's Kappa.
  • ...and 2 more figures