Report-based Recommendations for Policy Making and Agency Operations: Dataset and LLM Evaluation

Aleksandra Edwards; Thomas Edwards; Jose Camacho-Collados; Alun Preece

Report-based Recommendations for Policy Making and Agency Operations: Dataset and LLM Evaluation

Aleksandra Edwards, Thomas Edwards, Jose Camacho-Collados, Alun Preece

Abstract

Large Language Models (LLMs) are extensively used in text generation tasks. These generative capabilities bring us to a point where LLMs could potentially provide useful insights in policy making or agency operations. In this paper, we introduce a new task consisting of generating recommendations which can be used to inform future actions and improvements of agencies work within private and public organisations. In particular, we present the first benchmark and coherent evaluation for developing recommendation systems to inform organisation policies. This task is clearly different from usual product or user recommendation systems, but rather aims at providing a basis to suggest policy improvements based on the conclusions drawn from reports. Our results demonstrate that state-of-the-art LLMs have the potential to emphasize and reflect on key issues and learning points within generated recommendations.

Report-based Recommendations for Policy Making and Agency Operations: Dataset and LLM Evaluation

Abstract

Paper Structure (21 sections, 5 figures, 7 tables)

This paper contains 21 sections, 5 figures, 7 tables.

Introduction
Related Work
PubRec-Bench: Recommendation Generation Benchmark
Task Description
Dataset Collection and Unification
Data Statistics
Experimental Setting
Recommendation Generation
Evaluation
Results and Analysis
Automatic Evaluation
Human Evaluation
Correlation Analysis.
Discussion
Conclusions
...and 6 more sections

Figures (5)

Figure 1: An overview of the recommendation generation pipeline.
Figure 2: Comparison of LLM-based evaluations ('eval') in zero-shot settings (left) and one-shot settings (right) for recommendations generated by each model across the three datasets.
Figure 3: Spearman's rank correlation (left) and p-values (right) between manual evaluation and automated metrics-based evaluation across the three datasets where 'eval' refers to evaluation, 'Care Homes', 'US Chidlren Bureau' and 'NSPCC reports' refer to the results from the human-based evaluation for the Care Homes dataset, US Children Bureau, and NSPCC datasets, respectively.
Figure 4: Spearman's rank correlation between across the criteria for the manual evaluation where 'Rel. to evidence' refers to Relevance to the evidence, 'Rel. to human rec.' reference to Relevance to the human-created recommendation.
Figure 5: Instructions for human evaluation.

Report-based Recommendations for Policy Making and Agency Operations: Dataset and LLM Evaluation

Abstract

Report-based Recommendations for Policy Making and Agency Operations: Dataset and LLM Evaluation

Authors

Abstract

Table of Contents

Figures (5)