Towards Building a Robust Knowledge Intensive Question Answering Model with Large Language Models

Xingyun Hong; Yan Shao; Zhilin Wang; Manni Duan; Jin Xiongnan

Towards Building a Robust Knowledge Intensive Question Answering Model with Large Language Models

Xingyun Hong, Yan Shao, Zhilin Wang, Manni Duan, Jin Xiongnan

TL;DR

The paper tackles robustness of knowledge-intensive QA with large language models in the face of noisy retrieval data. It builds an MRC-based dataset with five interference types (SS, SSIncomp, MSCons, MSIncons, MSConf) and proposes a two-pronged approach: data-augmentation fine-tuning using mask and swap, and a contrastive learning objective that maximizes the gap between accepted and rejected outputs, formalized by a loss $L = -\sum_{k=1}^{N}{\log\sigma\left(\frac{1}{C}\sum_{c=1}^{C}\log p(y_{C_c}|x) - \frac{1}{R}\sum_{r=1}^{R}\log p(y_{R_r}|x)\right)}$. Experimental results across GPT-3.5-Turbo, Baichuan2-13B-Chat and others show improved robustness and discrimination, with fine-tuned BC2-13B-Chat achieving significant gains and approaching the performance of strong baselines on WSCORE. These findings suggest practical deployment benefits for QA systems facing noisy external information.

Abstract

The development of LLMs has greatly enhanced the intelligence and fluency of question answering, while the emergence of retrieval enhancement has enabled models to better utilize external information. However, the presence of noise and errors in retrieved information poses challenges to the robustness of LLMs. In this work, to evaluate the model's performance under multiple interferences, we first construct a dataset based on machine reading comprehension datasets simulating various scenarios, including critical information absence, noise, and conflicts. To address the issue of model accuracy decline caused by noisy external information, we propose a data augmentation-based fine-tuning method to enhance LLM's robustness against noise. Additionally, contrastive learning approach is utilized to preserve the model's discrimination capability of external information. We have conducted experiments on both existing LLMs and our approach, the results are evaluated by GPT-4, which indicates that our proposed methods improve model robustness while strengthening the model's discrimination capability.

Towards Building a Robust Knowledge Intensive Question Answering Model with Large Language Models

TL;DR

. Experimental results across GPT-3.5-Turbo, Baichuan2-13B-Chat and others show improved robustness and discrimination, with fine-tuned BC2-13B-Chat achieving significant gains and approaching the performance of strong baselines on WSCORE. These findings suggest practical deployment benefits for QA systems facing noisy external information.

Abstract

Paper Structure (18 sections, 1 equation, 3 figures, 4 tables)

This paper contains 18 sections, 1 equation, 3 figures, 4 tables.

Introduction
Related Work
Retrieval-Augmented LLMs
Robustness of LLMs
Dataset Construction
Methods
Data Augmentation
Contrastive Learning
Experiment
Models
Evaluation Metrics
Experimental Setup
Results
Ablation Study
Ablation Study of the First Stage Fine-tuning
...and 3 more sections

Figures (3)

Figure 1: An example of how external information impacts ChatGLM3-6B's performance.
Figure 2: Dataset construction: methods to create five kinds of samples.
Figure 3: Two-stage fine-tuning with the Baichuan2-13B-Chat model

Towards Building a Robust Knowledge Intensive Question Answering Model with Large Language Models

TL;DR

Abstract

Towards Building a Robust Knowledge Intensive Question Answering Model with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)