Table of Contents
Fetching ...

Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences

Shanshan Han, Salman Avestimehr, Chaoyang He

TL;DR

This work tackles safety risks in LLM inference by proposing a holistic guardrail pipeline, Wildflare GuardRail, that unifies detection, grounding, customization, and repair. It introduces four modules: Safety Detector (unsafe-input and hallucination detection with explanations), Grounding (vector-based contextualization via two indexing schemes), Customizer (real-time, wrapper-based output edits), and Repairer (hallucination correction guided by explanations). The approach leverages a lightweight Fox‑1 base model and specialized fine-tuned submodels, demonstrating competitive unsafe-content detection, efficient URL screening in about 1.06s per query, and 80.7% hallucination repair on standard datasets like HaluEval. The framework enables low-latency, configurable safety for latency-sensitive, high-stakes domains and provides a foundation for extending guardrails to emerging threats and multimodal safety scenarios.

Abstract

We present Wildflare GuardRail, a guardrail pipeline designed to enhance the safety and reliability of Large Language Model (LLM) inferences by systematically addressing risks across the entire processing workflow. Wildflare GuardRail integrates several core functional modules, including Safety Detector that identifies unsafe inputs and detects hallucinations in model outputs while generating root-cause explanations, Grounding that contextualizes user queries with information retrieved from vector databases, Customizer that adjusts outputs in real time using lightweight, rule-based wrappers, and Repairer that corrects erroneous LLM outputs using hallucination explanations provided by Safety Detector. Results show that our unsafe content detection model in Safety Detector achieves comparable performance with OpenAI API, though trained on a small dataset constructed with several public datasets. Meanwhile, the lightweight wrappers can address malicious URLs in model outputs in 1.06s per query with 100% accuracy without costly model calls. Moreover, the hallucination fixing model demonstrates effectiveness in reducing hallucinations with an accuracy of 80.7%.

Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences

TL;DR

This work tackles safety risks in LLM inference by proposing a holistic guardrail pipeline, Wildflare GuardRail, that unifies detection, grounding, customization, and repair. It introduces four modules: Safety Detector (unsafe-input and hallucination detection with explanations), Grounding (vector-based contextualization via two indexing schemes), Customizer (real-time, wrapper-based output edits), and Repairer (hallucination correction guided by explanations). The approach leverages a lightweight Fox‑1 base model and specialized fine-tuned submodels, demonstrating competitive unsafe-content detection, efficient URL screening in about 1.06s per query, and 80.7% hallucination repair on standard datasets like HaluEval. The framework enables low-latency, configurable safety for latency-sensitive, high-stakes domains and provides a foundation for extending guardrails to emerging threats and multimodal safety scenarios.

Abstract

We present Wildflare GuardRail, a guardrail pipeline designed to enhance the safety and reliability of Large Language Model (LLM) inferences by systematically addressing risks across the entire processing workflow. Wildflare GuardRail integrates several core functional modules, including Safety Detector that identifies unsafe inputs and detects hallucinations in model outputs while generating root-cause explanations, Grounding that contextualizes user queries with information retrieved from vector databases, Customizer that adjusts outputs in real time using lightweight, rule-based wrappers, and Repairer that corrects erroneous LLM outputs using hallucination explanations provided by Safety Detector. Results show that our unsafe content detection model in Safety Detector achieves comparable performance with OpenAI API, though trained on a small dataset constructed with several public datasets. Meanwhile, the lightweight wrappers can address malicious URLs in model outputs in 1.06s per query with 100% accuracy without costly model calls. Moreover, the hallucination fixing model demonstrates effectiveness in reducing hallucinations with an accuracy of 80.7%.

Paper Structure

This paper contains 11 sections, 1 equation, 6 figures, 3 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overview.
  • Figure 2: Prompt templates and sample training data for hallucination detection and reasoning.
  • Figure 3: Prompt templates and sample training data for Repairer.
  • Figure 4: Safety detection
  • Figure 5: Whole index
  • ...and 1 more figures

Theorems & Definitions (4)

  • Definition 1: Probability of hallucination
  • Definition 2: Callback
  • Example 1: Warning URLs
  • Example 2