KS-LLM: Knowledge Selection of Large Language Models with Evidence Document for Question Answering
Xinxin Zheng, Feihu Che, Jinyang Wu, Shuai Zhang, Shuai Nie, Kang Liu, Jianhua Tao
TL;DR
This work tackles hallucination in knowledge-intensive QA by using evidence documents more selectively. KS-LLM builds question-centered triples, retrieves the most relevant sentences from evidence documents by aligning them to these triples, and then generates answers using the combination of triples and selected sentences. The approach demonstrates that fusing structured knowledge with textual evidence yields superior results on TriviaQA-verified, WebQuestions, and Natural Questions across multiple open-source LLMs, with ablations highlighting the importance of limited evidence length and a small fixed number of retrieved sentences. The method reduces noise from full-document ingestion and showcases practical gains in accuracy and efficiency for open-domain QA tasks that rely on external knowledge.
Abstract
Large language models (LLMs) suffer from the hallucination problem and face significant challenges when applied to knowledge-intensive tasks. A promising approach is to leverage evidence documents as extra supporting knowledge, which can be obtained through retrieval or generation. However, existing methods directly leverage the entire contents of the evidence document, which may introduce noise information and impair the performance of large language models. To tackle this problem, we propose a novel Knowledge Selection of Large Language Models (KS-LLM) method, aiming to identify valuable information from evidence documents. The KS-LLM approach utilizes triples to effectively select knowledge snippets from evidence documents that are beneficial to answering questions. Specifically, we first generate triples based on the input question, then select the evidence sentences most similar to triples from the evidence document, and finally combine the evidence sentences and triples to assist large language models in generating answers. Experimental comparisons on several question answering datasets, such as TriviaQA, WebQ, and NQ, demonstrate that the proposed method surpasses the baselines and achieves the best results.
