Table of Contents
Fetching ...

Global-Liar: Factuality of LLMs over Time and Geographic Regions

Shujaat Mirza, Bruno Coelho, Yuyuan Cui, Christina Pöpper, Damon McCoy

TL;DR

This work interrogates the factuality and fairness of OpenAI GPT models across time and geographic regions using the Global-Liar dataset, a balanced corpus of 600 statements spanning six regions and pre/post-training-cutoff data. By employing multi-run prompts with a three-label verdict (true/false/unclear), zero-temperature settings, and rigorous metrics (stability via mode frequency and label switching; factuality via accuracy, precision, recall, F1, and Certainty Rate), the study reveals that GPT-4 March often outperforms other variants but newer versions (e.g., GPT-4 June) can underperform due to increased instances of

Abstract

The increasing reliance on AI-driven solutions, particularly Large Language Models (LLMs) like the GPT series, for information retrieval highlights the critical need for their factuality and fairness, especially amidst the rampant spread of misinformation and disinformation online. Our study evaluates the factual accuracy, stability, and biases in widely adopted GPT models, including GPT-3.5 and GPT-4, contributing to reliability and integrity of AI-mediated information dissemination. We introduce 'Global-Liar,' a dataset uniquely balanced in terms of geographic and temporal representation, facilitating a more nuanced evaluation of LLM biases. Our analysis reveals that newer iterations of GPT models do not always equate to improved performance. Notably, the GPT-4 version from March demonstrates higher factual accuracy than its subsequent June release. Furthermore, a concerning bias is observed, privileging statements from the Global North over the Global South, thus potentially exacerbating existing informational inequities. Regions such as Africa and the Middle East are at a disadvantage, with much lower factual accuracy. The performance fluctuations over time suggest that model updates may not consistently benefit all regions equally. Our study also offers insights into the impact of various LLM configuration settings, such as binary decision forcing, model re-runs and temperature, on model's factuality. Models constrained to binary (true/false) choices exhibit reduced factuality compared to those allowing an 'unclear' option. Single inference at a low temperature setting matches the reliability of majority voting across various configurations. The insights gained highlight the need for culturally diverse and geographically inclusive model training and evaluation. This approach is key to achieving global equity in technology, distributing AI benefits fairly worldwide.

Global-Liar: Factuality of LLMs over Time and Geographic Regions

TL;DR

This work interrogates the factuality and fairness of OpenAI GPT models across time and geographic regions using the Global-Liar dataset, a balanced corpus of 600 statements spanning six regions and pre/post-training-cutoff data. By employing multi-run prompts with a three-label verdict (true/false/unclear), zero-temperature settings, and rigorous metrics (stability via mode frequency and label switching; factuality via accuracy, precision, recall, F1, and Certainty Rate), the study reveals that GPT-4 March often outperforms other variants but newer versions (e.g., GPT-4 June) can underperform due to increased instances of

Abstract

The increasing reliance on AI-driven solutions, particularly Large Language Models (LLMs) like the GPT series, for information retrieval highlights the critical need for their factuality and fairness, especially amidst the rampant spread of misinformation and disinformation online. Our study evaluates the factual accuracy, stability, and biases in widely adopted GPT models, including GPT-3.5 and GPT-4, contributing to reliability and integrity of AI-mediated information dissemination. We introduce 'Global-Liar,' a dataset uniquely balanced in terms of geographic and temporal representation, facilitating a more nuanced evaluation of LLM biases. Our analysis reveals that newer iterations of GPT models do not always equate to improved performance. Notably, the GPT-4 version from March demonstrates higher factual accuracy than its subsequent June release. Furthermore, a concerning bias is observed, privileging statements from the Global North over the Global South, thus potentially exacerbating existing informational inequities. Regions such as Africa and the Middle East are at a disadvantage, with much lower factual accuracy. The performance fluctuations over time suggest that model updates may not consistently benefit all regions equally. Our study also offers insights into the impact of various LLM configuration settings, such as binary decision forcing, model re-runs and temperature, on model's factuality. Models constrained to binary (true/false) choices exhibit reduced factuality compared to those allowing an 'unclear' option. Single inference at a low temperature setting matches the reliability of majority voting across various configurations. The insights gained highlight the need for culturally diverse and geographically inclusive model training and evaluation. This approach is key to achieving global equity in technology, distributing AI benefits fairly worldwide.
Paper Structure (39 sections, 3 equations, 12 figures, 9 tables)

This paper contains 39 sections, 3 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Performance metrics across different models for three main temperature values. Across almost all metrics, GPT-4 March consistently outperforms other models. The dataset consists of 300 label-balanced statements originating before the training cutoff date of Sep 2021. Results for other temperatures are provided in the Appendix \ref{['app:subsec:all-temperatures']} (Figure \ref{['fig:spider-radar-all-temperatures']}).
  • Figure 2: Performance metrics across different models for temperature value 0. A marked decline in performance for post-cutoff period input is observed. Results for other temperatures in the Appendix \ref{['app:subsec:all-temperatures']} (Figure \ref{['fig:spider-radar-all-temperatures']}).
  • Figure 3: Performance metrics across different models and temperature values. The dataset consists of 300 label-balanced statements originating prior to the training cutoff date of Sep 2021.
  • Figure 4: Comparative performance of GPT-4 and GPT-3.5 models across varying temperature values, evaluated using accuracy. The dataset consists of 300 label-balanced statements originating prior to the training cutoff date of Sep 2021.
  • Figure 5: Comparative performance of GPT-4 and GPT-3.5 models across varying temperature values, evaluated using precision, recall, and F1 Score metrics. We treat statements with uncertain predictions as incorrect. The dataset consists of 300 label-balanced statements originating prior to the training cutoff date of Sep 2021. For comparison, we report performance excluding uncertain statemens in Figure \ref{['fig:precision_recall_f1_ignore_uncertain']} in Appendix \ref{['app:subsec:all-temperatures']}.
  • ...and 7 more figures