Table of Contents
Fetching ...

Are Frontier Large Language Models Suitable for Q&A in Science Centres?

Jacob Watson, Fabrício Góes, Marco Volpe, Talles Medeiros

TL;DR

This study evaluates frontier LLMs (GPT-4, Claude 3.5 Sonnet, Gemini 1.5) for Q&A in science centres using a real NSC question set aimed at an 8-year-old audience. Responses were generated under standard and creative prompts and assessed by space science experts on accuracy, clarity, engagement, deviation, and surprise. Claude consistently achieved higher accuracy, clarity, and engagement, while creative prompts increased novelty but often reduced factual reliability for GPT-4 and Gemini; standard prompts generally preserved accuracy. The work highlights the potential of LLMs in education while underscoring the need for careful prompt design and possible factual verification to safely deploy such systems in museums. Limitations include the lack of direct feedback from young visitors, suggesting future work on adaptive prompting and broader evaluation with user studies.

Abstract

This paper investigates the suitability of frontier Large Language Models (LLMs) for Q&A interactions in science centres, with the aim of boosting visitor engagement while maintaining factual accuracy. Using a dataset of questions collected from the National Space Centre in Leicester (UK), we evaluated responses generated by three leading models: OpenAI's GPT-4, Claude 3.5 Sonnet, and Google Gemini 1.5. Each model was prompted for both standard and creative responses tailored to an 8-year-old audience, and these responses were assessed by space science experts based on accuracy, engagement, clarity, novelty, and deviation from expected answers. The results revealed a trade-off between creativity and accuracy, with Claude outperforming GPT and Gemini in both maintaining clarity and engaging young audiences, even when asked to generate more creative responses. Nonetheless, experts observed that higher novelty was generally associated with reduced factual reliability across all models. This study highlights the potential of LLMs in educational settings, emphasizing the need for careful prompt engineering to balance engagement with scientific rigor.

Are Frontier Large Language Models Suitable for Q&A in Science Centres?

TL;DR

This study evaluates frontier LLMs (GPT-4, Claude 3.5 Sonnet, Gemini 1.5) for Q&A in science centres using a real NSC question set aimed at an 8-year-old audience. Responses were generated under standard and creative prompts and assessed by space science experts on accuracy, clarity, engagement, deviation, and surprise. Claude consistently achieved higher accuracy, clarity, and engagement, while creative prompts increased novelty but often reduced factual reliability for GPT-4 and Gemini; standard prompts generally preserved accuracy. The work highlights the potential of LLMs in education while underscoring the need for careful prompt design and possible factual verification to safely deploy such systems in museums. Limitations include the lack of direct feedback from young visitors, suggesting future work on adaptive prompting and broader evaluation with user studies.

Abstract

This paper investigates the suitability of frontier Large Language Models (LLMs) for Q&A interactions in science centres, with the aim of boosting visitor engagement while maintaining factual accuracy. Using a dataset of questions collected from the National Space Centre in Leicester (UK), we evaluated responses generated by three leading models: OpenAI's GPT-4, Claude 3.5 Sonnet, and Google Gemini 1.5. Each model was prompted for both standard and creative responses tailored to an 8-year-old audience, and these responses were assessed by space science experts based on accuracy, engagement, clarity, novelty, and deviation from expected answers. The results revealed a trade-off between creativity and accuracy, with Claude outperforming GPT and Gemini in both maintaining clarity and engaging young audiences, even when asked to generate more creative responses. Nonetheless, experts observed that higher novelty was generally associated with reduced factual reliability across all models. This study highlights the potential of LLMs in educational settings, emphasizing the need for careful prompt engineering to balance engagement with scientific rigor.

Paper Structure

This paper contains 21 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Average scores assigned by science experts across five evaluation metrics: (a) accuracy, (b) clarity, (c) engagement, (d) deviation from expected answers, and (e) surprise. They compare the responses of GPT-4, Claude 3.5, and Gemini 1.5 under standard and creative prompts.
  • Figure 2: Average scores for responses to closed, open, divergent, and wildcard questions, aggregated for all LLMs and categorized by standard and creative prompts. Scores highlight differences in performance for the following metrics: (a) accuracy, (b) clarity, (c) engagement, (d) deviation from expected answers, and (e) surprise.