Table of Contents
Fetching ...

Building Understandable Messaging for Policy and Evidence Review (BUMPER) with AI

Katherine A. Rosenfeld, Maike Sonnewald, Sonia J. Jindal, Kevin A. McCarthy, Joshua L. Proctor

TL;DR

The paper addresses the challenge of translating diverse scientific evidence into policy while mitigating the trust and accountability issues that accompany large language models, by introducing the Building Understandable Messaging for Policy and Evidence Review (BUMPER) framework. It formalizes a scientists-driven knowledge base $K$, actions $A=(a_0,...,a_J)$, guidelines $G$, and topics $T$ to drive a six-step evidence-synthesis loop, and introduces a novel compliance score $S$ derived from token probabilities $P_T(G)$ to separate in-scope conformity from factual correctness, keeping scientists in the loop. The contributions include formalizing ownership, establishing a transparent, scope-limited evaluation workflow, and prototyping code with rugby and measles health-policy case studies to demonstrate practical translation. The framework aims to scale to independent data sources without fine-tuning, improving accessibility and confidence in evidence-informed policy while highlighting the need for validation and human collaboration in high-stakes settings.

Abstract

We introduce a framework for the use of large language models (LLMs) in Building Understandable Messaging for Policy and Evidence Review (BUMPER). LLMs are proving capable of providing interfaces for understanding and synthesizing large databases of diverse media. This presents an exciting opportunity to supercharge the translation of scientific evidence into policy and action, thereby improving livelihoods around the world. However, these models also pose challenges related to access, trust-worthiness, and accountability. The BUMPER framework is built atop a scientific knowledge base (e.g., documentation, code, survey data) by the same scientists (e.g., individual contributor, lab, consortium). We focus on a solution that builds trustworthiness through transparency, scope-limiting, explicit-checks, and uncertainty measures. LLMs are rapidly being adopted and consequences are poorly understood. The framework addresses open questions regarding the reliability of LLMs and their use in high-stakes applications. We provide a worked example in health policy for a model designed to inform measles control programs. We argue that this framework can facilitate accessibility of and confidence in scientific evidence for policymakers, drive a focus on policy-relevance and translatability for researchers, and ultimately increase and accelerate the impact of scientific knowledge used for policy decisions.

Building Understandable Messaging for Policy and Evidence Review (BUMPER) with AI

TL;DR

The paper addresses the challenge of translating diverse scientific evidence into policy while mitigating the trust and accountability issues that accompany large language models, by introducing the Building Understandable Messaging for Policy and Evidence Review (BUMPER) framework. It formalizes a scientists-driven knowledge base , actions , guidelines , and topics to drive a six-step evidence-synthesis loop, and introduces a novel compliance score derived from token probabilities to separate in-scope conformity from factual correctness, keeping scientists in the loop. The contributions include formalizing ownership, establishing a transparent, scope-limited evaluation workflow, and prototyping code with rugby and measles health-policy case studies to demonstrate practical translation. The framework aims to scale to independent data sources without fine-tuning, improving accessibility and confidence in evidence-informed policy while highlighting the need for validation and human collaboration in high-stakes settings.

Abstract

We introduce a framework for the use of large language models (LLMs) in Building Understandable Messaging for Policy and Evidence Review (BUMPER). LLMs are proving capable of providing interfaces for understanding and synthesizing large databases of diverse media. This presents an exciting opportunity to supercharge the translation of scientific evidence into policy and action, thereby improving livelihoods around the world. However, these models also pose challenges related to access, trust-worthiness, and accountability. The BUMPER framework is built atop a scientific knowledge base (e.g., documentation, code, survey data) by the same scientists (e.g., individual contributor, lab, consortium). We focus on a solution that builds trustworthiness through transparency, scope-limiting, explicit-checks, and uncertainty measures. LLMs are rapidly being adopted and consequences are poorly understood. The framework addresses open questions regarding the reliability of LLMs and their use in high-stakes applications. We provide a worked example in health policy for a model designed to inform measles control programs. We argue that this framework can facilitate accessibility of and confidence in scientific evidence for policymakers, drive a focus on policy-relevance and translatability for researchers, and ultimately increase and accelerate the impact of scientific knowledge used for policy decisions.
Paper Structure (11 sections, 6 figures, 3 tables)

This paper contains 11 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: An overview of the BUMPER framework. See steps 1-5 in section \ref{['sec:framework']} for details.
  • Figure 2: Comparing ChatGPT4 (with context) to a statistical model BUMPER. a) Shows the BUMPER answer. b) contains a figure from the PyMC tutorial which plots the modelled attack statistic for 4 markov-chains coyle_pymc. c) Shows answers from ChatGPT4 run with the tutorial as context (run on 5/20/2024).
  • Figure 3: Comparison of visual evidence to textual evidence from BUMPER.
  • Figure 4: Distributions of compliance scores/token probabilities ($S=P_0$) and overall result (no/yes) returned by the guidelines check. The check is computed against the entire set $G = (c_0, \ldots, c_N) \cup (t_0, \ldots, t_m)$ with no prompt for explanation (see section \ref{['sec:framework']} details and appendix \ref{['appendix:prompts']} for examples). The check is called multiple times ($N=3$) for each synthesized answer ($N=25$) for a fixed query ($N=2$; row).
  • Figure 5: Distributions of compliance scores and overall result (no/yes) returned by the guidelines check. The check is called is multiple times ($N=3$) for multiple synthesized answers ($N=25$) for a fixed query ($N=2$; row). Blue/green bars are the same as plotted in Figure \ref{['fig:singlequery_probs']}.a) shows with both the explanation augmented prompt and individual assessments of each guideline element. b) shows in gray/purple the scores from an explanation augmented prompt.
  • ...and 1 more figures