Table of Contents
Fetching ...

On the Reliability of Large Language Models to Misinformed and Demographically-Informed Prompts

Toluwani Aremu, Oluwakemi Akinwehinmi, Chukwuemeka Nwagu, Syed Ishtiaque Ahmed, Rita Orji, Pedro Arnau Del Amo, Abdulmotaleb El Saddik

TL;DR

It is concluded that while these chatbots hold significant promise, their deployment in sensitive areas necessitates careful consideration, ethical oversight, and rigorous refinement to ensure they serve as a beneficial augmentation to human expertise rather than an autonomous solution.

Abstract

We investigate and observe the behaviour and performance of Large Language Model (LLM)-backed chatbots in addressing misinformed prompts and questions with demographic information within the domains of Climate Change and Mental Health. Through a combination of quantitative and qualitative methods, we assess the chatbots' ability to discern the veracity of statements, their adherence to facts, and the presence of bias or misinformation in their responses. Our quantitative analysis using True/False questions reveals that these chatbots can be relied on to give the right answers to these close-ended questions. However, the qualitative insights, gathered from domain experts, shows that there are still concerns regarding privacy, ethical implications, and the necessity for chatbots to direct users to professional services. We conclude that while these chatbots hold significant promise, their deployment in sensitive areas necessitates careful consideration, ethical oversight, and rigorous refinement to ensure they serve as a beneficial augmentation to human expertise rather than an autonomous solution.

On the Reliability of Large Language Models to Misinformed and Demographically-Informed Prompts

TL;DR

It is concluded that while these chatbots hold significant promise, their deployment in sensitive areas necessitates careful consideration, ethical oversight, and rigorous refinement to ensure they serve as a beneficial augmentation to human expertise rather than an autonomous solution.

Abstract

We investigate and observe the behaviour and performance of Large Language Model (LLM)-backed chatbots in addressing misinformed prompts and questions with demographic information within the domains of Climate Change and Mental Health. Through a combination of quantitative and qualitative methods, we assess the chatbots' ability to discern the veracity of statements, their adherence to facts, and the presence of bias or misinformation in their responses. Our quantitative analysis using True/False questions reveals that these chatbots can be relied on to give the right answers to these close-ended questions. However, the qualitative insights, gathered from domain experts, shows that there are still concerns regarding privacy, ethical implications, and the necessity for chatbots to direct users to professional services. We conclude that while these chatbots hold significant promise, their deployment in sensitive areas necessitates careful consideration, ethical oversight, and rigorous refinement to ensure they serve as a beneficial augmentation to human expertise rather than an autonomous solution.

Paper Structure

This paper contains 28 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Distribution of domain experts in the fields of Climate Change and Mental Health, categorized by location and profession. In both domains, Asia and Africa provide the largest regional share of experts, while North America and Oceania contribute the least. The majority of experts in both domains are from Academic & Research backgrounds, accounting for 57% of the total, as opposed to 43% from Industry.
  • Figure 2: Confusion matrices depicting the performance of the selected chatbots in answering whether a fact given within a prompt is either true or false, for the Climate Change and Mental Health domains. For Climate Change, there were 1,368 true negatives and 1,436 true positives, with false positives and negatives at 188 and 127, respectively. In the Mental Health domain, the model produced 1,253 true negatives and 1,301 true positives, and lower false positives and negatives at 143 and 65. These results indicate a high level of accuracy in the model's knowledge across both domains.
  • Figure 3: The bar charts illustrate the Similarity Index Scores for three LLM-powered chatbots—ChatGPT (GPT-3.5), Bard (LaMDA), and Bing (GPT-4)—across three evaluation metrics: BLEU, ROUGE, and METEOR.