Table of Contents
Fetching ...

Are Large Language Models Consistent over Value-laden Questions?

Jared Moore, Tanvi Deshpande, Diyi Yang

TL;DR

Value consistency is defined as the similarity of answers across paraphrases, use-cases, translations, and within a topic and it is found that models are relatively consistent across paraphrases, use-cases, translations, and within a topic.

Abstract

Large language models (LLMs) appear to bias their survey answers toward certain values. Nonetheless, some argue that LLMs are too inconsistent to simulate particular values. Are they? To answer, we first define value consistency as the similarity of answers across (1) paraphrases of one question, (2) related questions under one topic, (3) multiple-choice and open-ended use-cases of one question, and (4) multilingual translations of a question to English, Chinese, German, and Japanese. We apply these measures to small and large, open LLMs including llama-3, as well as gpt-4o, using 8,000 questions spanning more than 300 topics. Unlike prior work, we find that models are relatively consistent across paraphrases, use-cases, translations, and within a topic. Still, some inconsistencies remain. Models are more consistent on uncontroversial topics (e.g., in the U.S., "Thanksgiving") than on controversial ones ("euthanasia"). Base models are both more consistent compared to fine-tuned models and are uniform in their consistency across topics, while fine-tuned models are more inconsistent about some topics ("euthanasia") than others ("women's rights") like our human subjects (n=165).

Are Large Language Models Consistent over Value-laden Questions?

TL;DR

Value consistency is defined as the similarity of answers across paraphrases, use-cases, translations, and within a topic and it is found that models are relatively consistent across paraphrases, use-cases, translations, and within a topic.

Abstract

Large language models (LLMs) appear to bias their survey answers toward certain values. Nonetheless, some argue that LLMs are too inconsistent to simulate particular values. Are they? To answer, we first define value consistency as the similarity of answers across (1) paraphrases of one question, (2) related questions under one topic, (3) multiple-choice and open-ended use-cases of one question, and (4) multilingual translations of a question to English, Chinese, German, and Japanese. We apply these measures to small and large, open LLMs including llama-3, as well as gpt-4o, using 8,000 questions spanning more than 300 topics. Unlike prior work, we find that models are relatively consistent across paraphrases, use-cases, translations, and within a topic. Still, some inconsistencies remain. Models are more consistent on uncontroversial topics (e.g., in the U.S., "Thanksgiving") than on controversial ones ("euthanasia"). Base models are both more consistent compared to fine-tuned models and are uniform in their consistency across topics, while fine-tuned models are more inconsistent about some topics ("euthanasia") than others ("women's rights") like our human subjects (n=165).
Paper Structure (54 sections, 10 equations, 27 figures, 13 tables)

This paper contains 54 sections, 10 equations, 27 figures, 13 tables.

Figures (27)

  • Figure 1: Similar to our human participants (n=84), chat models are inconsistent (change their answers) on topics like euthanasia and religious freedom but they are consistent on topics like women's rights and income inequality. This is less the case for base models like llama3-base. To measure such topic inconsistency, we prompted models with similar questions about a specific topic, measuring the distance between answers using a variant of the Jensen-Shannon divergence, the D-dimensional divergence (§ \ref{['sec:distance']}). Shown here are the two topics with the highest and lowest topic inconsistency across models in English on U.S.-based topics; other languages and topics reported elsewhere.
  • Figure 2: Constructing ValueConsistency. We prompted gpt-4 to generate {un}controversial topics, questions, paraphrases, and translations for the U.S., China, Germany, and Japan in their respective dominant languages (§ \ref{['sec:dataset']}). We then translated those data to {eng, chi, ger, jpn} also using gpt-4. This allows us to compare how consistent LLMs are on measures of topic, paraphrase, use-case, and multi-lingualism (§ \ref{['sec:method']}, Tab. \ref{['tab:measures']}).
  • Figure 3: Base models are more consistently consistent unlike chat models and human participants. On the x-axis is each topic ordered by least to most consistent in English on U.S.-based topics. Each colored bar shows either the topic consistency (top plots) or paraphrase consistency (bottom plots). Both fine-tuned models and human participants (n=84 for topic, n=81 for paraphrase) show a greater spread than base models. Error bars show 95% bootstrapped confidence intervals. The dashed line shows the upper limit of .46 for our measure of inconsistency, the D-D divergence (§ \ref{['sec:distance']}, §\ref{['app:more-distance']}).
  • Figure 4: Models are relatively consistent across our measures. They are as or more consistent than our human participants (n=81 for paraphrase and n=84 for topic consistency, § \ref{['sec:experiment-setup']}). In these plots we only compare topics for the U.S. in English (except in multilingual consistency, where we compare across up to all of {eng, chi, ger, jpn}). Error bars show 95% bootstrapped confidence intervals. The dashed line shows the upper limit of .46 for our measure of inconsistency, the D-D divergence (§ \ref{['sec:distance']}, §\ref{['app:more-distance']}).
  • Figure 5: Chat models are more consistent over uncontroversial than controversial questions. Each plot shows a different model answering questions from a given country and language. The the x-axis shows the paraphrase and topic inconsistency for each. Error bars show 95% bootstrapped confidence intervals.
  • ...and 22 more figures