Table of Contents
Fetching ...

Modeling Annotator Disagreement with Demographic-Aware Experts and Synthetic Perspectives

Yinuo Xu, Veronica Derricks, Allison Earl, David Jurgens

TL;DR

The paper presents DeM-MoE, a demographic-aware mixture of experts that routes inputs to experts based on annotator demographics to model structured disagreement in subjective NLP tasks. It demonstrates that this inductive bias yields robust performance across diverse datasets with varying levels of annotator disagreement, outperforming several baselines, especially on low-agreement tasks. It also investigates zero-shot LLM-generated synthetic annotations and develops strategies for blending real and synthetic data, showing dataset-dependent gains and highlighting the need to tailor augmentation to dataset structure. Overall, the work advances perspective-aware learning by combining architecture and data-centric techniques to better represent diverse annotator viewpoints. The approach offers practical pathways to scale nuanced, demographic-aligned ratings while acknowledging limitations and ethical considerations.

Abstract

We present an approach to modeling annotator disagreement in subjective NLP tasks through both architectural and data-centric innovations. Our model, DEM-MoE (Demographic-Aware Mixture of Experts), routes inputs to expert subnetworks based on annotator demographics, enabling it to better represent structured, group-level variation compared to prior models. DEM-MoE consistently performs competitively across demographic groups, and shows especially strong results on datasets with high annotator disagreement. To address sparse demographic coverage, we test whether LLM-generated synthetic annotations via zero-shot persona prompting can be used for data imputation. We show these synthetic judgments align moderately well with human annotations on our data and offer a scalable way to potentially enrich training data. We then propose and evaluate approaches for blending real and synthetic data using strategies tailored to dataset structure. We find that the optimal strategies depend on dataset structure. Together, these contributions improve the representation of diverse perspectives.

Modeling Annotator Disagreement with Demographic-Aware Experts and Synthetic Perspectives

TL;DR

The paper presents DeM-MoE, a demographic-aware mixture of experts that routes inputs to experts based on annotator demographics to model structured disagreement in subjective NLP tasks. It demonstrates that this inductive bias yields robust performance across diverse datasets with varying levels of annotator disagreement, outperforming several baselines, especially on low-agreement tasks. It also investigates zero-shot LLM-generated synthetic annotations and develops strategies for blending real and synthetic data, showing dataset-dependent gains and highlighting the need to tailor augmentation to dataset structure. Overall, the work advances perspective-aware learning by combining architecture and data-centric techniques to better represent diverse annotator viewpoints. The approach offers practical pathways to scale nuanced, demographic-aligned ratings while acknowledging limitations and ethical considerations.

Abstract

We present an approach to modeling annotator disagreement in subjective NLP tasks through both architectural and data-centric innovations. Our model, DEM-MoE (Demographic-Aware Mixture of Experts), routes inputs to expert subnetworks based on annotator demographics, enabling it to better represent structured, group-level variation compared to prior models. DEM-MoE consistently performs competitively across demographic groups, and shows especially strong results on datasets with high annotator disagreement. To address sparse demographic coverage, we test whether LLM-generated synthetic annotations via zero-shot persona prompting can be used for data imputation. We show these synthetic judgments align moderately well with human annotations on our data and offer a scalable way to potentially enrich training data. We then propose and evaluate approaches for blending real and synthetic data using strategies tailored to dataset structure. We find that the optimal strategies depend on dataset structure. Together, these contributions improve the representation of diverse perspectives.

Paper Structure

This paper contains 48 sections, 1 equation, 22 figures, 28 tables.

Figures (22)

  • Figure 1: Comparison of Mean MAE across demographics for all datasets (lower MAE is better). We obtain the mean and error bars from bootstrap samples. The star (*) above our MoE model indicates that it is statistically better ($p < 0.05$) than next-best model. The circle (o) above MoE indicates that it is statistically equivalent to best model.
  • Figure 2: Mean pairwise KL diversity in expert usage distributions across subgroups for each demographic (higher KL shows more specialization).
  • Figure 3: Mean MAE across demographic categories by training strategy and synthetic-data generation method (lower is better), shown for the three datasets. The purple horizontal line is the MAE of DeM-MoE trained only on real data (see Experiment 1) with 95% confidence intervals. The shaded regions denote data generation methods.
  • Figure 4: Screenshot of our questions.
  • Figure 5: The architecture of our DeM-MoE model
  • ...and 17 more figures