Table of Contents
Fetching ...

Harnessing the Power of Semi-Structured Knowledge and LLMs with Triplet-Based Prefiltering for Question Answering

Derian Boer, Fabian Koch, Stefan Kramer

TL;DR

The paper tackles the problem of LLMs producing unreliable, hallucination‑prone answers in domain contexts by introducing 4StepFocus, a four‑stage pipeline that leverages a semi‑structured knowledge base to provide traceable external context. It first prompts an LLM to extract triplets $(h_i, e_i, t_i)$ and a target variable, then uses a knowledge graph to substitute variables and generate a filtered candidate set $C_{filtered}$, followed by a vector similarity search over associated unstructured data and a final LLM reranking with background data. This triplet‑based prefiltering is designed to improve retrieval precision and interpretability compared with purely latent methods. Evaluations on the STaRK benchmarks across medical, product, and academic QA demonstrate that 4StepFocus achieves notable gains in metrics such as Hit@k, Recall@k, and MRR, particularly on relationally rich or text‑heavy datasets, while also highlighting areas for future improvements like integrating vector similarity into the substitution step and expanding reasoning capabilities. Overall, the work provides a practical, interpretable route to augment LLMs with external, traceable knowledge to improve evidence‑based question answering.

Abstract

Large Language Models (LLMs) frequently lack domain-specific knowledge and even fine-tuned models tend to hallucinate. Hence, more reliable models that can include external knowledge are needed. We present a pipeline, 4StepFocus, and specifically a preprocessing step, that can substantially improve the answers of LLMs. This is achieved by providing guided access to external knowledge making use of the model's ability to capture relational context and conduct rudimentary reasoning by themselves. The method narrows down potentially correct answers by triplets-based searches in a semi-structured knowledge base in a direct, traceable fashion, before switching to latent representations for ranking those candidates based on unstructured data. This distinguishes it from related methods that are purely based on latent representations. 4StepFocus consists of the steps: 1) Triplet generation for extraction of relational data by an LLM, 2) substitution of variables in those triplets to narrow down answer candidates employing a knowledge graph, 3) sorting remaining candidates with a vector similarity search involving associated non-structured data, 4) reranking the best candidates by the LLM with background data provided. Experiments on a medical, a product recommendation, and an academic paper search test set demonstrate that this approach is indeed a powerful augmentation. It not only adds relevant traceable background information from information retrieval, but also improves performance considerably in comparison to state-of-the-art methods. This paper presents a novel, largely unexplored direction and therefore provides a wide range of future work opportunities. Used source code is available at https://github.com/kramerlab/4StepFocus.

Harnessing the Power of Semi-Structured Knowledge and LLMs with Triplet-Based Prefiltering for Question Answering

TL;DR

The paper tackles the problem of LLMs producing unreliable, hallucination‑prone answers in domain contexts by introducing 4StepFocus, a four‑stage pipeline that leverages a semi‑structured knowledge base to provide traceable external context. It first prompts an LLM to extract triplets and a target variable, then uses a knowledge graph to substitute variables and generate a filtered candidate set , followed by a vector similarity search over associated unstructured data and a final LLM reranking with background data. This triplet‑based prefiltering is designed to improve retrieval precision and interpretability compared with purely latent methods. Evaluations on the STaRK benchmarks across medical, product, and academic QA demonstrate that 4StepFocus achieves notable gains in metrics such as Hit@k, Recall@k, and MRR, particularly on relationally rich or text‑heavy datasets, while also highlighting areas for future improvements like integrating vector similarity into the substitution step and expanding reasoning capabilities. Overall, the work provides a practical, interpretable route to augment LLMs with external, traceable knowledge to improve evidence‑based question answering.

Abstract

Large Language Models (LLMs) frequently lack domain-specific knowledge and even fine-tuned models tend to hallucinate. Hence, more reliable models that can include external knowledge are needed. We present a pipeline, 4StepFocus, and specifically a preprocessing step, that can substantially improve the answers of LLMs. This is achieved by providing guided access to external knowledge making use of the model's ability to capture relational context and conduct rudimentary reasoning by themselves. The method narrows down potentially correct answers by triplets-based searches in a semi-structured knowledge base in a direct, traceable fashion, before switching to latent representations for ranking those candidates based on unstructured data. This distinguishes it from related methods that are purely based on latent representations. 4StepFocus consists of the steps: 1) Triplet generation for extraction of relational data by an LLM, 2) substitution of variables in those triplets to narrow down answer candidates employing a knowledge graph, 3) sorting remaining candidates with a vector similarity search involving associated non-structured data, 4) reranking the best candidates by the LLM with background data provided. Experiments on a medical, a product recommendation, and an academic paper search test set demonstrate that this approach is indeed a powerful augmentation. It not only adds relevant traceable background information from information retrieval, but also improves performance considerably in comparison to state-of-the-art methods. This paper presents a novel, largely unexplored direction and therefore provides a wide range of future work opportunities. Used source code is available at https://github.com/kramerlab/4StepFocus.
Paper Structure (6 sections, 1 figure, 1 table, 3 algorithms)

This paper contains 6 sections, 1 figure, 1 table, 3 algorithms.

Figures (1)

  • Figure 1: Pipeline of 4StepFocus that enhances VSS + LLM Reranker by triplet-based prefiltering steps.