Towards A Cultural Intelligence and Values Inferences Quality Benchmark for Community Values and Common Knowledge
Brittany Johnson, Erin Reddick, Angela D. R. Smith
TL;DR
This work addresses the cultural misalignment of large language models by proposing CIVIQ, a US Black-community–centered benchmark inspired by KorNAT to evaluate LLM alignment with community social values and common knowledge. It outlines a comprehensive methodology: curating social values topics from timely and conflict-related keywords, generating and validating items with LLMs and human editors, developing common knowledge items with expert input and ChatBlackGPT, and deploying a large-scale stratified survey with rigorous SVA and CKA scoring. The paper emphasizes ethical governance through IRB oversight, a community data covenant, and controlled data sharing, aiming to enable trustworthy, culturally aware AI tooling for diverse software teams. If successful, CIVIQ could drive industry-wide adoption of culturally intelligent AI and foster collaborations across academia, industry, and civil society to mitigate biases in AI systems.
Abstract
Large language models (LLMs) have emerged as a powerful technology, and thus, we have seen widespread adoption and use on software engineering teams. Most often, LLMs are designed as "general purpose" technologies meant to represent the general population. Unfortunately, this often means alignment with predominantly Western Caucasian narratives and misalignment with other cultures and populations that engage in collaborative innovation. In response to this misalignment, there have been recent efforts centered on the development of "culturally-informed" LLMs, such as ChatBlackGPT, that are capable of better aligning with historically marginalized experiences and perspectives. Despite this progress, there has been little effort aimed at supporting our ability to develop and evaluate culturally-informed LLMs. A recent effort proposed an approach for developing a national alignment benchmark that emphasizes alignment with national social values and common knowledge. However, given the range of cultural identities present in the United States (U.S.), a national alignment benchmark is an ineffective goal for broader representation. To help fill this gap in this US context, we propose a replication study that translates the process used to develop KorNAT, a Korean National LLM alignment benchmark, to develop CIVIQ, a Cultural Intelligence and Values Inference Quality benchmark centered on alignment with community social values and common knowledge. Our work provides a critical foundation for research and development aimed at cultural alignment of AI technologies in practice.
