Suppressing Domain-Specific Hallucination in Construction LLMs: A Knowledge Graph Foundation for GraphRAG and QLoRA on River and Sediment Control Technical Standards

Takato Yasuno

Suppressing Domain-Specific Hallucination in Construction LLMs: A Knowledge Graph Foundation for GraphRAG and QLoRA on River and Sediment Control Technical Standards

Takato Yasuno

Abstract

This paper addresses the challenge of answering technical questions derived from Japan's River and Sediment Control Technical Standards -- a multi-volume regulatory document covering survey, planning, design, and maintenance of river levees, dams, and sabo structures -- using open-source large language models running entirely on local hardware. We implement and evaluate three complementary approaches: Case A (plain 20B LLM baseline), Case B (8B LLM with QLoRA domain fine-tuning on 715 graph-derived QA pairs), and Case C (20B LLM augmented with a Neo4j knowledge graph via GraphRAG). All three cases use the Swallow series of Japanese-adapted LLMs and are evaluated on a 100-question benchmark spanning 8 technical categories, judged automatically by an independent LLM (Qwen2.5-14B, score 0--3). The key finding is a performance inversion: the 8B QLoRA fine-tuned model (Case B) achieves a judge average of 2.92/3 -- surpassing both the 20B plain baseline (Case A: 2.29/3, $+$0.63) and the 20B GraphRAG approach (Case C: 2.62/3, $+$0.30) -- while running at 3$\times$ faster latency (14.2s vs. 42.2s for Case A). GraphRAG provides moderate gains ($+$0.33 over baseline) but is outperformed by domain-specific fine-tuning in both quality and efficiency. We document the full engineering pipeline, including knowledge graph construction (200 nodes, 268 relations), QLoRA training data generation from Neo4j relations, training on a single GPU (16 GB VRAM) using unsloth, GGUF Q4_K_M quantisation and Ollama deployment, and the graph retrieval and re-ranking design. High-level engineering lessons are distilled in the main body; implementation pitfalls and toolchain details are documented in Supplementary Materials.

Suppressing Domain-Specific Hallucination in Construction LLMs: A Knowledge Graph Foundation for GraphRAG and QLoRA on River and Sediment Control Technical Standards

Abstract

0.63) and the 20B GraphRAG approach (Case C: 2.62/3,

0.30) -- while running at 3

faster latency (14.2s vs. 42.2s for Case A). GraphRAG provides moderate gains (

0.33 over baseline) but is outperformed by domain-specific fine-tuning in both quality and efficiency. We document the full engineering pipeline, including knowledge graph construction (200 nodes, 268 relations), QLoRA training data generation from Neo4j relations, training on a single GPU (16 GB VRAM) using unsloth, GGUF Q4_K_M quantisation and Ollama deployment, and the graph retrieval and re-ranking design. High-level engineering lessons are distilled in the main body; implementation pitfalls and toolchain details are documented in Supplementary Materials.

Paper Structure (70 sections, 15 equations, 5 figures, 8 tables)

This paper contains 70 sections, 15 equations, 5 figures, 8 tables.

Introduction
Problem: Domain-Specific Hallucination.
Proposed Approach.
Related Work
Retrieval-Augmented Generation
Knowledge Graph Construction and Prompting
Parameter-Efficient Fine-Tuning
Japanese LLMs
LLM-as-Judge and Evaluation
Domain-Specific LLM Applications
Problem Formulation
Task.
Evaluation.
Test set.
Fairness.
...and 55 more sections

Figures (5)

Figure 1: Knowledge graph schema (Node & Relation Map). The Structural Hierarchy (left, orange) encodes four-level document structure. The Domain Semantics (right, blue) encodes engineering entities and their mutual relations. Dashed grey arrows cross-link domain nodes to structural locations (DESCRIBED_IN, DEFINED_IN). In total: 9 node types, 200 nodes, 11 relation types, 268 relations.
Figure 2: GraphRAG inference pipeline (Case C). At query time, keywords extracted from the user question drive five parallel Neo4j Cypher queries whose results are deduplicated and scored. If fewer than 25 hits are returned, an adaptive retry doubles TOP_K and broadens the match to substring search. The top-scoring 80% of records (up to 2,000 chars) are prepended to the plain-LLM prompt before generation by Swallow-20B. Qwen2.5-14B then evaluates the output and returns $s_{\mathrm{J}} \in \{0,1,2,3\}$.
Figure 3: Approach trade-off: inference speed vs. answer quality (normalised). Case B (QLoRA FT) occupies the ideal upper-right quadrant --- highest quality and fastest inference.
Figure 4: Judge score distributions (0--3) for Cases A, B, and C across 100 questions.
Figure 5: Evolution of approaches: accuracy improvement across experimental phases. QLoRA FT (Case B) achieves the highest score at the final stage.

Suppressing Domain-Specific Hallucination in Construction LLMs: A Knowledge Graph Foundation for GraphRAG and QLoRA on River and Sediment Control Technical Standards

Abstract

Suppressing Domain-Specific Hallucination in Construction LLMs: A Knowledge Graph Foundation for GraphRAG and QLoRA on River and Sediment Control Technical Standards

Authors

Abstract

Table of Contents

Figures (5)