Table of Contents
Fetching ...

Measuring the Groundedness of Legal Question-Answering Systems

Dietrich Trautmann, Natalia Ostapuk, Quentin Grail, Adrian Alan Pol, Guglielmo Bonifazi, Shang Gao, Martin Gajek

TL;DR

This work presents a comprehensive benchmark of various methods to assess the groundedness of AI-generated responses, aiming to significantly enhance their reliability and demonstrates the potential of various detection methods to improve the trustworthiness of generative AI in legal settings.

Abstract

In high-stakes domains like legal question-answering, the accuracy and trustworthiness of generative AI systems are of paramount importance. This work presents a comprehensive benchmark of various methods to assess the groundedness of AI-generated responses, aiming to significantly enhance their reliability. Our experiments include similarity-based metrics and natural language inference models to evaluate whether responses are well-founded in the given contexts. We also explore different prompting strategies for large language models to improve the detection of ungrounded responses. We validated the effectiveness of these methods using a newly created grounding classification corpus, designed specifically for legal queries and corresponding responses from retrieval-augmented prompting, focusing on their alignment with source material. Our results indicate potential in groundedness classification of generated responses, with the best method achieving a macro-F1 score of 0.8. Additionally, we evaluated the methods in terms of their latency to determine their suitability for real-world applications, as this step typically follows the generation process. This capability is essential for processes that may trigger additional manual verification or automated response regeneration. In summary, this study demonstrates the potential of various detection methods to improve the trustworthiness of generative AI in legal settings.

Measuring the Groundedness of Legal Question-Answering Systems

TL;DR

This work presents a comprehensive benchmark of various methods to assess the groundedness of AI-generated responses, aiming to significantly enhance their reliability and demonstrates the potential of various detection methods to improve the trustworthiness of generative AI in legal settings.

Abstract

In high-stakes domains like legal question-answering, the accuracy and trustworthiness of generative AI systems are of paramount importance. This work presents a comprehensive benchmark of various methods to assess the groundedness of AI-generated responses, aiming to significantly enhance their reliability. Our experiments include similarity-based metrics and natural language inference models to evaluate whether responses are well-founded in the given contexts. We also explore different prompting strategies for large language models to improve the detection of ungrounded responses. We validated the effectiveness of these methods using a newly created grounding classification corpus, designed specifically for legal queries and corresponding responses from retrieval-augmented prompting, focusing on their alignment with source material. Our results indicate potential in groundedness classification of generated responses, with the best method achieving a macro-F1 score of 0.8. Additionally, we evaluated the methods in terms of their latency to determine their suitability for real-world applications, as this step typically follows the generation process. This capability is essential for processes that may trigger additional manual verification or automated response regeneration. In summary, this study demonstrates the potential of various detection methods to improve the trustworthiness of generative AI in legal settings.

Paper Structure

This paper contains 33 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Example query and corresponding LLM responses with grounded and erroneous spans (Procedural Errors). The retrieved context used for grounding the responses was omitted due to its length. The remaining sentences in both responses are identical and grounded, but not highlighted to emphasize the differences.
  • Figure 2: Development set results for our benchmark. We report the F1-scores (y-axis) for each method and the corresponding latency (x-axis) in seconds per response. Approach names denoted with * were run on an AWS ml.8xlarge instance.
  • Figure 3: Counts of unique error types in the development set. Some responses contained up to three different error types. The frequency axis is in log-scale.
  • Figure 4: Counts of response error types in the development set. The frequency axis is in log-scale.