Table of Contents
Fetching ...

Report-based Recommendations for Policy Making and Agency Operations: Dataset and LLM Evaluation

Aleksandra Edwards, Thomas Edwards, Jose Camacho-Collados, Alun Preece

Abstract

Large Language Models (LLMs) are extensively used in text generation tasks. These generative capabilities bring us to a point where LLMs could potentially provide useful insights in policy making or agency operations. In this paper, we introduce a new task consisting of generating recommendations which can be used to inform future actions and improvements of agencies work within private and public organisations. In particular, we present the first benchmark and coherent evaluation for developing recommendation systems to inform organisation policies. This task is clearly different from usual product or user recommendation systems, but rather aims at providing a basis to suggest policy improvements based on the conclusions drawn from reports. Our results demonstrate that state-of-the-art LLMs have the potential to emphasize and reflect on key issues and learning points within generated recommendations.

Report-based Recommendations for Policy Making and Agency Operations: Dataset and LLM Evaluation

Abstract

Large Language Models (LLMs) are extensively used in text generation tasks. These generative capabilities bring us to a point where LLMs could potentially provide useful insights in policy making or agency operations. In this paper, we introduce a new task consisting of generating recommendations which can be used to inform future actions and improvements of agencies work within private and public organisations. In particular, we present the first benchmark and coherent evaluation for developing recommendation systems to inform organisation policies. This task is clearly different from usual product or user recommendation systems, but rather aims at providing a basis to suggest policy improvements based on the conclusions drawn from reports. Our results demonstrate that state-of-the-art LLMs have the potential to emphasize and reflect on key issues and learning points within generated recommendations.
Paper Structure (21 sections, 5 figures, 7 tables)

This paper contains 21 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: An overview of the recommendation generation pipeline.
  • Figure 2: Comparison of LLM-based evaluations ('eval') in zero-shot settings (left) and one-shot settings (right) for recommendations generated by each model across the three datasets.
  • Figure 3: Spearman's rank correlation (left) and p-values (right) between manual evaluation and automated metrics-based evaluation across the three datasets where 'eval' refers to evaluation, 'Care Homes', 'US Chidlren Bureau' and 'NSPCC reports' refer to the results from the human-based evaluation for the Care Homes dataset, US Children Bureau, and NSPCC datasets, respectively.
  • Figure 4: Spearman's rank correlation between across the criteria for the manual evaluation where 'Rel. to evidence' refers to Relevance to the evidence, 'Rel. to human rec.' reference to Relevance to the human-created recommendation.
  • Figure 5: Instructions for human evaluation.