Table of Contents
Fetching ...

CrowdLLM: Building LLM-Based Digital Populations Augmented with Generative Models

Ryan Feng Lin, Keyu Tian, Hanming Zheng, Congjing Zhang, Li Zeng, Shuai Huang

TL;DR

CrowdLLM tackles the challenge of creating realistic digital populations by coupling a frozen LLM with a lightweight generative belief model to inject human-like diversity. The framework covers virtual participant recruitment, reference and belief generation, personalized decision-making, and crowd-level aggregation, all trained with a compact objective that requires limited real data. Theoretical analysis shows that a target population can be approximated and that diversity in the digital population reduces risk under reasonable conditions, with performance dependent on LLM backbone quality and belief variance. Empirical results across crowdsourcing, product ratings, and voting demonstrate superior distributional fidelity and competitive accuracy relative to strong baselines, and simulations reveal favorable trade-offs between diversity, data quantity, and cost. Overall, CrowdLLM provides a scalable, data-efficient approach to synthesizing human-grade digital populations for various decision-making tasks.

Abstract

The emergence of large language models (LLMs) has sparked much interest in creating LLM-based digital populations that can be applied to many applications such as social simulation, crowdsourcing, marketing, and recommendation systems. A digital population can reduce the cost of recruiting human participants and alleviate many concerns related to human subject study. However, research has found that most of the existing works rely solely on LLMs and could not sufficiently capture the accuracy and diversity of a real human population. To address this limitation, we propose CrowdLLM that integrates pretrained LLMs and generative models to enhance the diversity and fidelity of the digital population. We conduct theoretical analysis of CrowdLLM regarding its great potential in creating cost-effective, sufficiently representative, scalable digital populations that can match the quality of a real crowd. Comprehensive experiments are also conducted across multiple domains (e.g., crowdsourcing, voting, user rating) and simulation studies which demonstrate that CrowdLLM achieves promising performance in both accuracy and distributional fidelity to human data.

CrowdLLM: Building LLM-Based Digital Populations Augmented with Generative Models

TL;DR

CrowdLLM tackles the challenge of creating realistic digital populations by coupling a frozen LLM with a lightweight generative belief model to inject human-like diversity. The framework covers virtual participant recruitment, reference and belief generation, personalized decision-making, and crowd-level aggregation, all trained with a compact objective that requires limited real data. Theoretical analysis shows that a target population can be approximated and that diversity in the digital population reduces risk under reasonable conditions, with performance dependent on LLM backbone quality and belief variance. Empirical results across crowdsourcing, product ratings, and voting demonstrate superior distributional fidelity and competitive accuracy relative to strong baselines, and simulations reveal favorable trade-offs between diversity, data quantity, and cost. Overall, CrowdLLM provides a scalable, data-efficient approach to synthesizing human-grade digital populations for various decision-making tasks.

Abstract

The emergence of large language models (LLMs) has sparked much interest in creating LLM-based digital populations that can be applied to many applications such as social simulation, crowdsourcing, marketing, and recommendation systems. A digital population can reduce the cost of recruiting human participants and alleviate many concerns related to human subject study. However, research has found that most of the existing works rely solely on LLMs and could not sufficiently capture the accuracy and diversity of a real human population. To address this limitation, we propose CrowdLLM that integrates pretrained LLMs and generative models to enhance the diversity and fidelity of the digital population. We conduct theoretical analysis of CrowdLLM regarding its great potential in creating cost-effective, sufficiently representative, scalable digital populations that can match the quality of a real crowd. Comprehensive experiments are also conducted across multiple domains (e.g., crowdsourcing, voting, user rating) and simulation studies which demonstrate that CrowdLLM achieves promising performance in both accuracy and distributional fidelity to human data.

Paper Structure

This paper contains 41 sections, 7 theorems, 51 equations, 26 figures, 5 tables.

Key Result

Theorem 1

Suppose the profiles are $d$-dimensional bounded vectors following a target mixed-type distribution $\mathcal{T}$. Consider $\rho$ as an easy-to-sample distribution taken to be uniform on $(0,1)^{d+1}$. For any $\varepsilon\in(0,1)$, there exists a profile generator $G$ building on a generative mode Here, $W_1$ is the Wasserstein-1 distance, $G_{\sharp}\rho$ is the pushforward of $\rho$ by $G_{\sh

Figures (26)

  • Figure 1: A comparison of different decision-making workflows. (a) LLM: Decisions are purely made by LLM through the input of prompts. (b) Real population: Diverse decisions are made by a population of humans with diverse profiles. (c) CrowdLLM: Diverse decisions are made by simulated humans. Each simulated human's decision is a blend of a reference decision generated by a pretrained LLM and the personal belief bias generated by a belief generator. The simulated humans are sampled probabilistically by a profile generator.
  • Figure 2: An illustration of the risk decomposition for a specific problem $\boldsymbol{x}$. The yellow circle represents the sample human population $U$ while the purple dashed circle represents their digital counterpart. The balls in the circles represent a physical human individual $u_i$ and their digital counterpart $\tilde{u}_i$. $\overline{y}_i$ and $\overline{\tilde{y}}_i$ are the expected responses of the human individual and the digital individual, respectively. $y_i$ and $\tilde{y}_i$ are their corresponding noisy observations. The empirical mean of the individual noisy responses $\tilde{y}_i$ across the whole digital population is represented by the light purple point $\overline{\tilde{y}}$. The dark purple triangle $y_{ref}$ is the reference response generated by the LLM. The star $\overline{y}$ represents the average response of the sample population, adopted as a "ground truth". The five components $L_1$ to $L_5$ are explained in Theorem 2.
  • Figure 3: An example of the distribution of participants' profiles.
  • Figure 4: MAE with increasing simulated workers across training worker sizes; CrowdLLM (x%) means the model is trained with x% of the real human workers' data
  • Figure 5: Resolution rate with increasing simulated workers in CrowdLLM under diverse and fixed profiles; CrowdLLM (x%) means the model is trained with x% of the human workers' data
  • ...and 21 more figures

Theorems & Definitions (9)

  • Theorem 1
  • Theorem 2
  • Proposition 1
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • proof
  • Lemma 1
  • proof