Table of Contents
Fetching ...

Reducing hallucination in structured outputs via Retrieval-Augmented Generation

Patrice Béchard, Orlando Marquez Ayala

TL;DR

The paper tackles hallucination in Generative AI when producing structured outputs, such as workflow JSON from natural language requirements. It proposes a Retrieval-Augmented Generation pipeline with a domain-specific retriever and a separately trained LLM, showing that retrieved JSON objects can guide generation to fewer hallucinations while enabling smaller models. Empirical results on internal enterprise data and out-of-domain splits demonstrate substantial reductions in hallucinated steps and tables, with a 7B LLM and a compact 110M retriever delivering strong performance and deployment feasibility. The work highlights practical engineering implications and suggests directions for joint training and further efficiency improvements, aiming to make reliable, enterprise-grade GenAI more scalable and trustworthy.

Abstract

A common and fundamental limitation of Generative AI (GenAI) is its propensity to hallucinate. While large language models (LLM) have taken the world by storm, without eliminating or at least reducing hallucinations, real-world GenAI systems may face challenges in user adoption. In the process of deploying an enterprise application that produces workflows based on natural language requirements, we devised a system leveraging Retrieval Augmented Generation (RAG) to greatly improve the quality of the structured output that represents such workflows. Thanks to our implementation of RAG, our proposed system significantly reduces hallucinations in the output and improves the generalization of our LLM in out-of-domain settings. In addition, we show that using a small, well-trained retriever encoder can reduce the size of the accompanying LLM, thereby making deployments of LLM-based systems less resource-intensive.

Reducing hallucination in structured outputs via Retrieval-Augmented Generation

TL;DR

The paper tackles hallucination in Generative AI when producing structured outputs, such as workflow JSON from natural language requirements. It proposes a Retrieval-Augmented Generation pipeline with a domain-specific retriever and a separately trained LLM, showing that retrieved JSON objects can guide generation to fewer hallucinations while enabling smaller models. Empirical results on internal enterprise data and out-of-domain splits demonstrate substantial reductions in hallucinated steps and tables, with a 7B LLM and a compact 110M retriever delivering strong performance and deployment feasibility. The work highlights practical engineering implications and suggests directions for joint training and further efficiency improvements, aiming to make reliable, enterprise-grade GenAI more scalable and trustworthy.

Abstract

A common and fundamental limitation of Generative AI (GenAI) is its propensity to hallucinate. While large language models (LLM) have taken the world by storm, without eliminating or at least reducing hallucinations, real-world GenAI systems may face challenges in user adoption. In the process of deploying an enterprise application that produces workflows based on natural language requirements, we devised a system leveraging Retrieval Augmented Generation (RAG) to greatly improve the quality of the structured output that represents such workflows. Thanks to our implementation of RAG, our proposed system significantly reduces hallucinations in the output and improves the generalization of our LLM in out-of-domain settings. In addition, we show that using a small, well-trained retriever encoder can reduce the size of the accompanying LLM, thereby making deployments of LLM-based systems less resource-intensive.
Paper Structure (19 sections, 3 equations, 4 figures, 7 tables)

This paper contains 19 sections, 3 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Sample structured output (JSON) to generate given a natural language requirement.
  • Figure 2: High-level architecture diagram showing how the user query is used by both the retriever and the LLM to generate the structured JSON output.
  • Figure 3: Training example, where the last four lines are the expected output (in red). The underlined text comes from the retriever's output.
  • Figure 4: Examples where both the retriever and the LLM worked perfectly and where each of them failed: (a) All expected step names were suggested and used by the LLM. (b) The retriever did not suggest the step send_slack_message and therefore the LLM used the common step send_notification instead. (c) The LLM should have used the TRY step as the parent to all the steps, but it did not fully understand the user query.