Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction

Brian Freeman; Adam Kicklighter; Matt Erdman; Zach Gordon

Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction

Brian Freeman, Adam Kicklighter, Matt Erdman, Zach Gordon

TL;DR

Five prompt engineering strategies intended to reduce the variance of model outputs and move toward repeatable, grounded results without modifying model weights or creating complex validation models are presented and compared.

Abstract

Hallucinations in large language models (LLMs) are outputs that are syntactically coherent but factually incorrect or contextually inconsistent. They are persistent obstacles in high-stakes industrial settings such as engineering design, enterprise resource planning, and IoT telemetry platforms. We present and compare five prompt engineering strategies intended to reduce the variance of model outputs and move toward repeatable, grounded results without modifying model weights or creating complex validation models. These methods include: (M1) Iterative Similarity Convergence, (M2) Decomposed Model-Agnostic Prompting, (M3) Single-Task Agent Specialization, (M4) Enhanced Data Registry, and (M5) Domain Glossary Injection. Each method is evaluated against an internal baseline using an LLM-as-Judge framework over 100 repeated runs per method (same fixed task prompt, stochastic decoding at $τ= 0.7$. Under this evaluation setup, M4 (Enhanced Data Registry) received ``Better'' verdicts in all 100 trials; M3 and M5 reached 80\% and 77\% respectively; M1 reached 75\%; and M2 was net negative at 34\% when compared to single shot prompting with a modern foundation model. We then developed enhanced version 2 (v2) implementations and assessed them on a 10-trial verification batch; M2 recovered from 34\% to 80\%, the largest gain among the four revised methods. We discuss how these strategies help overcome the non-deterministic nature of LLM results for industrial procedures, even when absolute correctness cannot be guaranteed. We provide pseudocode, verbatim prompts, and batch logs to support independent assessment.

Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction

TL;DR

Abstract

. Under this evaluation setup, M4 (Enhanced Data Registry) received ``Better'' verdicts in all 100 trials; M3 and M5 reached 80\% and 77\% respectively; M1 reached 75\%; and M2 was net negative at 34\% when compared to single shot prompting with a modern foundation model. We then developed enhanced version 2 (v2) implementations and assessed them on a 10-trial verification batch; M2 recovered from 34\% to 80\%, the largest gain among the four revised methods. We discuss how these strategies help overcome the non-deterministic nature of LLM results for industrial procedures, even when absolute correctness cannot be guaranteed. We provide pseudocode, verbatim prompts, and batch logs to support independent assessment.

Paper Structure (58 sections, 6 equations, 7 figures, 5 tables, 9 algorithms)

This paper contains 58 sections, 6 equations, 7 figures, 5 tables, 9 algorithms.

Introduction
Temperature and Sampling Variance
Hallucinations in Large Language Models
Error in Statistical Machine Learning Systems
Handling Errors
Application-Specific Failure Modes
Contributions and Scope
Epistemic Stability: Moving Toward Operable Certainty
Machine Learning Context
Scope of the Epistemic Certainty Claim
Related Work
Hallucination Taxonomy and Surveys
Self-Consistency and Iterative Prompting
Chain-of-Thought and Decomposition
Self-Critique and Reflection
...and 43 more sections

Figures (7)

Figure 1: Taxonomy of the five hallucination-reduction strategies and the root cause each addresses. M1 and M2 target prompt reasoning and structure; M3 targets agent architecture; M4 and M5 target input data quality.
Figure 2: Iterative convergence profile for M1 across three representative trials. Diamonds mark the iteration reaching $\sigma_{\text{sim}} = 0.85$.
Figure 3: LLM-as-Judge evaluation pipeline. Each trial generates a method-specific baseline and an enhanced response independently, then a zero-temperature judge produces dimension-level scores and an aggregate verdict.
Figure 4: D1 baseline 100-trial results. All five methods shown. M4 received "Better" in all 100 trials under this judge rubric; M2 was net negative.
Figure 5: D2 verification results: 10-trial batch. M1 v2, M3 v2, and M4 each received "Better" in all 10 trials; these should be treated as provisional. M2 v2 shows the largest single-method gain over its v1 baseline (+46 points). M5 v2 discussion in Section \ref{['subsec:m5disc']}.
...and 2 more figures

Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction

TL;DR

Abstract

Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction

Authors

TL;DR

Abstract

Table of Contents

Figures (7)