Table of Contents
Fetching ...

Assessing Historical Structural Oppression Worldwide via Rule-Guided Prompting of Large Language Models

Sreejato Chatterjee, Linh Tran, Quoc Duy Nguyen, Roni Kirson, Drue Hamlin, Harvest Aquino, Hanjia Lyu, Jiebo Luo, Timothy Dye

TL;DR

The paper tackles the challenge of measuring historical structural oppression across diverse countries by using rule-guided prompting of large language models (LLMs) to derive context-sensitive oppression scores from free-text, self-identified ethnicity data. It builds a bottom-up five-level oppression schema from multilingual responses and formalizes a rule-guided prompting framework to ensure historically grounded, cross-context scoring. Evaluations across Gemini 1.5 Pro, GPT-3.5 Turbo, and GPT-4o mini show that rule-guided prompting yields strong alignment with human expert annotations (e.g., $r=0.852$, $\rho=0.844$ for the best model) and outperforms vanilla and chain-of-thought baselines. The authors release the open HSO-Bench dataset to standardize evaluation and discuss limitations and future directions for calibration, regional variability, and retrieval-augmented approaches. Overall, the work demonstrates that with principled prompting, LLMs can serve as scalable tools to capture identity-based historical oppression in data-driven research and public health contexts.

Abstract

Traditional efforts to measure historical structural oppression struggle with cross-national validity due to the unique, locally specified histories of exclusion, colonization, and social status in each country, and often have relied on structured indices that privilege material resources while overlooking lived, identity-based exclusion. We introduce a novel framework for oppression measurement that leverages Large Language Models (LLMs) to generate context-sensitive scores of lived historical disadvantage across diverse geopolitical settings. Using unstructured self-identified ethnicity utterances from a multilingual COVID-19 global study, we design rule-guided prompting strategies that encourage models to produce interpretable, theoretically grounded estimations of oppression. We systematically evaluate these strategies across multiple state-of-the-art LLMs. Our results demonstrate that LLMs, when guided by explicit rules, can capture nuanced forms of identity-based historical oppression within nations. This approach provides a complementary measurement tool that highlights dimensions of systemic exclusion, offering a scalable, cross-cultural lens for understanding how oppression manifests in data-driven research and public health contexts. To support reproducible evaluation, we release an open-sourced benchmark dataset for assessing LLMs on oppression measurement (https://github.com/chattergpt/HSO-Bench).

Assessing Historical Structural Oppression Worldwide via Rule-Guided Prompting of Large Language Models

TL;DR

The paper tackles the challenge of measuring historical structural oppression across diverse countries by using rule-guided prompting of large language models (LLMs) to derive context-sensitive oppression scores from free-text, self-identified ethnicity data. It builds a bottom-up five-level oppression schema from multilingual responses and formalizes a rule-guided prompting framework to ensure historically grounded, cross-context scoring. Evaluations across Gemini 1.5 Pro, GPT-3.5 Turbo, and GPT-4o mini show that rule-guided prompting yields strong alignment with human expert annotations (e.g., , for the best model) and outperforms vanilla and chain-of-thought baselines. The authors release the open HSO-Bench dataset to standardize evaluation and discuss limitations and future directions for calibration, regional variability, and retrieval-augmented approaches. Overall, the work demonstrates that with principled prompting, LLMs can serve as scalable tools to capture identity-based historical oppression in data-driven research and public health contexts.

Abstract

Traditional efforts to measure historical structural oppression struggle with cross-national validity due to the unique, locally specified histories of exclusion, colonization, and social status in each country, and often have relied on structured indices that privilege material resources while overlooking lived, identity-based exclusion. We introduce a novel framework for oppression measurement that leverages Large Language Models (LLMs) to generate context-sensitive scores of lived historical disadvantage across diverse geopolitical settings. Using unstructured self-identified ethnicity utterances from a multilingual COVID-19 global study, we design rule-guided prompting strategies that encourage models to produce interpretable, theoretically grounded estimations of oppression. We systematically evaluate these strategies across multiple state-of-the-art LLMs. Our results demonstrate that LLMs, when guided by explicit rules, can capture nuanced forms of identity-based historical oppression within nations. This approach provides a complementary measurement tool that highlights dimensions of systemic exclusion, offering a scalable, cross-cultural lens for understanding how oppression manifests in data-driven research and public health contexts. To support reproducible evaluation, we release an open-sourced benchmark dataset for assessing LLMs on oppression measurement (https://github.com/chattergpt/HSO-Bench).

Paper Structure

This paper contains 22 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Illustration of the prompting setup used for our experiment, comparing Rule-Guided Prompting (RG) with two baseline methods: Vanilla and Chain-of-Thought (CoT). Each prompt is constructed sequentially as follows: (1) system role $\rightarrow$ (2) identity statement $\rightarrow$ (3) instruction $\rightarrow$ (4) oppression schema $\rightarrow$ (5) optional CoT or RG instruction $\rightarrow$ (6) required output format. The complete set of rules for the RG method is provided in Section \ref{['sec:rule_guided']}, and the set of guiding questions for the CoT method is detailed in Section \ref{['sec:baseline']}.
  • Figure 2: Pearson correlations between LLM predictions and human annotations across countries. Brazil has the highest alignment at $r$ = 0.86 while Algeria and Madagascar have the lowest alignment around $r$ = 0.5.
  • Figure 3: Distribution of absolute differences between LLM-predicted oppression scores and human expert annotations across 334 identity-country pairs, grouped by model. GPT-3.5-Turbo produces the highest number of high-magnitude errors ($|\text{difference}| \geq 2$), while GPT-4o-mini produces the fewest deviations.
  • Figure 4: Distribution of absolute differences between LLM-predicted oppression scores and human expert annotations across 334 identity-country pairs, grouped by prompt method. Chain-of-Thought (CoT) prompting produces the highest number of high-magnitude errors ($|\text{difference}| \geq 2$), while Rule-Guided prompting produces the fewest deviations.
  • Figure 5: Highest Divergence between Human and LLM ratings across all models and prompt strategies. Majority of divergences occur from identities in Algeria, or are produced from scores generated by either CoT prompting or GPT-3.5-turbo.