Towards a Robust Retrieval-Based Summarization System

Shengjie Liu; Jing Wu; Jingyuan Bao; Wenyi Wang; Naira Hovakimyan; Christopher G Healey

Towards a Robust Retrieval-Based Summarization System

Shengjie Liu, Jing Wu, Jingyuan Bao, Wenyi Wang, Naira Hovakimyan, Christopher G Healey

TL;DR

This work addresses the robustness of retrieval-augmented summarization with large language models by introducing LogicSumm, a seven-scenario evaluation pipeline that stress-tests RAG-based summarization across diverse retrieval conditions. Building on LogicSumm, SummRAG provides an end-to-end framework for generating training dialogues and fine-tuning models, notably using LoRA to adapt Mistral-7B Instruct, aiming to achieve robust performance close to GPT-4 Turbo. Experimental results demonstrate improved logical coherence and summarization quality, particularly in multi-document settings where irrelevant content can derail succinct summaries. The authors release data, model weights, and code, proposing a structured, generalizable approach to strengthening LLM-based RAG systems beyond one-off fixes.

Abstract

This paper describes an investigation of the robustness of large language models (LLMs) for retrieval augmented generation (RAG)-based summarization tasks. While LLMs provide summarization capabilities, their performance in complex, real-world scenarios remains under-explored. Our first contribution is LogicSumm, an innovative evaluation framework incorporating realistic scenarios to assess LLM robustness during RAG-based summarization. Based on limitations identified by LogiSumm, we then developed SummRAG, a comprehensive system to create training dialogues and fine-tune a model to enhance robustness within LogicSumm's scenarios. SummRAG is an example of our goal of defining structured methods to test the capabilities of an LLM, rather than addressing issues in a one-off fashion. Experimental results confirm the power of SummRAG, showcasing improved logical coherence and summarization quality. Data, corresponding model weights, and Python code are available online.

Towards a Robust Retrieval-Based Summarization System

TL;DR

Abstract

Paper Structure (24 sections, 2 equations, 5 figures, 4 tables)

This paper contains 24 sections, 2 equations, 5 figures, 4 tables.

Introduction
Related Work
Large Language Model
Retrieval Augmented Generation
Text Summarization
LogicSumm
Implementation and Evaluation Metrics
SummRAG
Logical Special Tokens
Dialogue Generation: Top-1 Document
Dialogue Generation: Top-k Documents
Model Fine-Tuning
Connection to Prior Work
Experiments
Implementation and Evaluation Metrics
...and 9 more sections

Figures (5)

Figure 1: LogicSumm's pipeline, which divides evaluation into four aspects and seven scenarios
Figure 2: Illustration of limitations under LogicSumm
Figure 3: Dialogue generation for the top-$\mathbf{1}$ document
Figure 4: Dialogue generation for the top-$\mathbf{k}$ documents
Figure 5: Example dialogue at each summarization step

Towards a Robust Retrieval-Based Summarization System

TL;DR

Abstract

Towards a Robust Retrieval-Based Summarization System

Authors

TL;DR

Abstract

Table of Contents

Figures (5)