UQLM: A Python Package for Uncertainty Quantification in Large Language Models
Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Ho-Kyeong Ra, Viren Bajaj, Zeya Ahmad
TL;DR
This work tackles the problem of LLM hallucinations by introducing UQLM, a Python toolkit that unifies generation-time uncertainty quantification across four scorer families—Black-Box UQ, White-Box UQ, LLM-as-a-Judge, and ensembles—to produce response-level confidence scores in the range $[0,1]$ without ground-truth data. It provides an integrated generate_and_score workflow, supports multiple scoring strategies, and enables ensemble tuning to optimize reliability. The contributions include a comprehensive, open-source implementation, flexible integration with LangChain, and practical guidance for tuning and deployment to improve AI safety in real-world applications. Overall, UQLM lowers barriers to robust hallucination detection and facilitates safer, more trustworthy LLM deployments across domains.
Abstract
Hallucinations, defined as instances where Large Language Models (LLMs) generate false or misleading content, pose a significant challenge that impacts the safety and trust of downstream applications. We introduce UQLM, a Python package for LLM hallucination detection using state-of-the-art uncertainty quantification (UQ) techniques. This toolkit offers a suite of UQ-based scorers that compute response-level confidence scores ranging from 0 to 1. This library provides an off-the-shelf solution for UQ-based hallucination detection that can be easily integrated to enhance the reliability of LLM outputs.
