Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)

Krishnaram Kenthapadi; Mehrnoosh Sameki; Ankur Taly

Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)

Krishnaram Kenthapadi, Mehrnoosh Sameki, Ankur Taly

TL;DR

This survey addresses the trust, safety, and observability challenges of large language models (LLMs) in high-stakes settings. It surveys a broad set of evaluation dimensions—truthfulness, safety/alignment, bias, robustness, privacy, calibration, and transparency—and presents grounding and operationalization strategies, including Retrieval Augmented Generation, constrained decoding, guardrails, and comprehensive LLM operations. The article provides a structured taxonomy of problem statements, solution approaches, and open challenges across each dimension, offering practical guidance and industry lessons for deploying robust, grounded, and auditable LLM systems. By connecting evaluation, grounding, and observability, the paper highlights how to reduce harm, improve accountability, and enable scalable, responsible use of generative AI in real-world applications, while also delineating key research directions for future work. The tutorial and survey collectively aim to empower researchers and practitioners to build safer, more reliable LLM-based systems with measurable governance and trustworthiness.

Abstract

With the ongoing rapid adoption of Artificial Intelligence (AI)-based systems in high-stakes domains, ensuring the trustworthiness, safety, and observability of these systems has become crucial. It is essential to evaluate and monitor AI systems not only for accuracy and quality-related metrics but also for robustness, bias, security, interpretability, and other responsible AI dimensions. We focus on large language models (LLMs) and other generative AI models, which present additional challenges such as hallucinations, harmful and manipulative content, and copyright infringement. In this survey article accompanying our KDD 2024 tutorial, we highlight a wide range of harms associated with generative AI systems, and survey state of the art approaches (along with open challenges) to address these harms.

Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)

TL;DR

Abstract

Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)

Authors

TL;DR

Abstract

Table of Contents