Table of Contents
Fetching ...

Evaluating the Performance of Large Language Models for SDG Mapping (Technical Report)

Hui Yin, Amir Aryani, Nakul Nambiar

TL;DR

This work evaluates open-source LLMs for SDG mapping against a GPT-4o baseline, using a uniform prompt on 1,000 Swinburne publications and multi-label evaluation with micro-averaged $F1$, precision $P$, and recall $R$ across thresholds $t \in \{0.1,\dots,0.9\}$. It finds that GPT-4o-mini, LLaMA-3, and Qwen2 deliver strong performance with relatively similar curves, while LLaMA-2 and Gemma underperform; optimal performance occurs near $t \approx 0.5$–$0.6$, with GPT-4o-mini excelling in recall. The results guide researchers in selecting open-source LLMs for SDG mapping under privacy and resource constraints, and emphasize the trade-offs between precision and recall at different thresholds. All outputs and related data are openly available on Zenodo to enable replication and further benchmarking.

Abstract

The use of large language models (LLMs) is expanding rapidly, and open-source versions are becoming available, offering users safer and more adaptable options. These models enable users to protect data privacy by eliminating the need to provide data to third parties and can be customized for specific tasks. In this study, we compare the performance of various language models on the Sustainable Development Goal (SDG) mapping task, using the output of GPT-4o as the baseline. The selected open-source models for comparison include Mixtral, LLaMA 2, LLaMA 3, Gemma, and Qwen2. Additionally, GPT-4o-mini, a more specialized version of GPT-4o, was included to extend the comparison. Given the multi-label nature of the SDG mapping task, we employed metrics such as F1 score, precision, and recall with micro-averaging to evaluate different aspects of the models' performance. These metrics are derived from the confusion matrix to ensure a comprehensive evaluation. We provide a clear observation and analysis of each model's performance by plotting curves based on F1 score, precision, and recall at different thresholds. According to the results of this experiment, LLaMA 2 and Gemma still have significant room for improvement. The other four models do not exhibit particularly large differences in performance. The outputs from all seven models are available on Zenodo: https://doi.org/10.5281/zenodo.12789375.

Evaluating the Performance of Large Language Models for SDG Mapping (Technical Report)

TL;DR

This work evaluates open-source LLMs for SDG mapping against a GPT-4o baseline, using a uniform prompt on 1,000 Swinburne publications and multi-label evaluation with micro-averaged , precision , and recall across thresholds . It finds that GPT-4o-mini, LLaMA-3, and Qwen2 deliver strong performance with relatively similar curves, while LLaMA-2 and Gemma underperform; optimal performance occurs near , with GPT-4o-mini excelling in recall. The results guide researchers in selecting open-source LLMs for SDG mapping under privacy and resource constraints, and emphasize the trade-offs between precision and recall at different thresholds. All outputs and related data are openly available on Zenodo to enable replication and further benchmarking.

Abstract

The use of large language models (LLMs) is expanding rapidly, and open-source versions are becoming available, offering users safer and more adaptable options. These models enable users to protect data privacy by eliminating the need to provide data to third parties and can be customized for specific tasks. In this study, we compare the performance of various language models on the Sustainable Development Goal (SDG) mapping task, using the output of GPT-4o as the baseline. The selected open-source models for comparison include Mixtral, LLaMA 2, LLaMA 3, Gemma, and Qwen2. Additionally, GPT-4o-mini, a more specialized version of GPT-4o, was included to extend the comparison. Given the multi-label nature of the SDG mapping task, we employed metrics such as F1 score, precision, and recall with micro-averaging to evaluate different aspects of the models' performance. These metrics are derived from the confusion matrix to ensure a comprehensive evaluation. We provide a clear observation and analysis of each model's performance by plotting curves based on F1 score, precision, and recall at different thresholds. According to the results of this experiment, LLaMA 2 and Gemma still have significant room for improvement. The other four models do not exhibit particularly large differences in performance. The outputs from all seven models are available on Zenodo: https://doi.org/10.5281/zenodo.12789375.
Paper Structure (4 sections, 2 figures, 2 tables)

This paper contains 4 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Comparison of F1 scores across different model thresholds.
  • Figure 2: Comparison of Precision and Recall across different model thresholds.