Faithfulness metric fusion: Improving the evaluation of LLM trustworthiness across domains
Ben Malin, Tatiana Kalganova, Nikolaos Boulgouris
TL;DR
This work tackles faithfulness evaluation for LLM outputs by fusing multiple elementary metrics into a single fused score using an Explainable Boosting Machine. It constructs a cross-domain, homogenised dataset with human judgments and LLM outputs across QA, dialogue, and summarisation, and leverages AMR graphs to enhance evaluation. The study shows that fusion improves correlation with human judgments across domains, with domain-tailored weightings highlighting the complementary strengths of LLM-based, n-gram, and graph-based metrics. The findings underscore the value of granular, interpretable metric fusion for trustworthy LLM deployment in diverse settings, including safety-critical or high-stakes applications.
Abstract
We present a methodology for improving the accuracy of faithfulness evaluation in Large Language Models (LLMs). The proposed methodology is based on the combination of elementary faithfulness metrics into a combined (fused) metric, for the purpose of improving the faithfulness of LLM outputs. The proposed strategy for metric fusion deploys a tree-based model to identify the importance of each metric, which is driven by the integration of human judgements evaluating the faithfulness of LLM responses. This fused metric is demonstrated to correlate more strongly with human judgements across all tested domains for faithfulness. Improving the ability to evaluate the faithfulness of LLMs, allows for greater confidence to be placed within models, allowing for their implementation in a greater diversity of scenarios. Additionally, we homogenise a collection of datasets across question answering and dialogue-based domains and implement human judgements and LLM responses within this dataset, allowing for the reproduction and trialling of faithfulness evaluation across domains.
