Table of Contents
Fetching ...

UQLM: A Python Package for Uncertainty Quantification in Large Language Models

Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Ho-Kyeong Ra, Viren Bajaj, Zeya Ahmad

TL;DR

This work tackles the problem of LLM hallucinations by introducing UQLM, a Python toolkit that unifies generation-time uncertainty quantification across four scorer families—Black-Box UQ, White-Box UQ, LLM-as-a-Judge, and ensembles—to produce response-level confidence scores in the range $[0,1]$ without ground-truth data. It provides an integrated generate_and_score workflow, supports multiple scoring strategies, and enables ensemble tuning to optimize reliability. The contributions include a comprehensive, open-source implementation, flexible integration with LangChain, and practical guidance for tuning and deployment to improve AI safety in real-world applications. Overall, UQLM lowers barriers to robust hallucination detection and facilitates safer, more trustworthy LLM deployments across domains.

Abstract

Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1. This library provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance the reliability of LLM outputs.

UQLM: A Python Package for Uncertainty Quantification in Large Language Models

TL;DR

This work tackles the problem of LLM hallucinations by introducing UQLM, a Python toolkit that unifies generation-time uncertainty quantification across four scorer families—Black-Box UQ, White-Box UQ, LLM-as-a-Judge, and ensembles—to produce response-level confidence scores in the range without ground-truth data. It provides an integrated generate_and_score workflow, supports multiple scoring strategies, and enables ensemble tuning to optimize reliability. The contributions include a comprehensive, open-source implementation, flexible integration with LangChain, and practical guidance for tuning and deployment to improve AI safety in real-world applications. Overall, UQLM lowers barriers to robust hallucination detection and facilitates safer, more trustworthy LLM deployments across domains.

Abstract

Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1. This library provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance the reliability of LLM outputs.

Paper Structure

This paper contains 9 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Illustration of a Black-Box Scorer Workflow
  • Figure 2: Illustration of a Single-Generation White-Box Scorer Workflow
  • Figure 3: Illustration of LLM-as-a-Judge Workflow
  • Figure 4: Illustration of Ensemble Tuning