Table of Contents
Fetching ...

Towards a Robust Retrieval-Based Summarization System

Shengjie Liu, Jing Wu, Jingyuan Bao, Wenyi Wang, Naira Hovakimyan, Christopher G Healey

TL;DR

This work addresses the robustness of retrieval-augmented summarization with large language models by introducing LogicSumm, a seven-scenario evaluation pipeline that stress-tests RAG-based summarization across diverse retrieval conditions. Building on LogicSumm, SummRAG provides an end-to-end framework for generating training dialogues and fine-tuning models, notably using LoRA to adapt Mistral-7B Instruct, aiming to achieve robust performance close to GPT-4 Turbo. Experimental results demonstrate improved logical coherence and summarization quality, particularly in multi-document settings where irrelevant content can derail succinct summaries. The authors release data, model weights, and code, proposing a structured, generalizable approach to strengthening LLM-based RAG systems beyond one-off fixes.

Abstract

This paper describes an investigation of the robustness of large language models (LLMs) for retrieval augmented generation (RAG)-based summarization tasks. While LLMs provide summarization capabilities, their performance in complex, real-world scenarios remains under-explored. Our first contribution is LogicSumm, an innovative evaluation framework incorporating realistic scenarios to assess LLM robustness during RAG-based summarization. Based on limitations identified by LogiSumm, we then developed SummRAG, a comprehensive system to create training dialogues and fine-tune a model to enhance robustness within LogicSumm's scenarios. SummRAG is an example of our goal of defining structured methods to test the capabilities of an LLM, rather than addressing issues in a one-off fashion. Experimental results confirm the power of SummRAG, showcasing improved logical coherence and summarization quality. Data, corresponding model weights, and Python code are available online.

Towards a Robust Retrieval-Based Summarization System

TL;DR

This work addresses the robustness of retrieval-augmented summarization with large language models by introducing LogicSumm, a seven-scenario evaluation pipeline that stress-tests RAG-based summarization across diverse retrieval conditions. Building on LogicSumm, SummRAG provides an end-to-end framework for generating training dialogues and fine-tuning models, notably using LoRA to adapt Mistral-7B Instruct, aiming to achieve robust performance close to GPT-4 Turbo. Experimental results demonstrate improved logical coherence and summarization quality, particularly in multi-document settings where irrelevant content can derail succinct summaries. The authors release data, model weights, and code, proposing a structured, generalizable approach to strengthening LLM-based RAG systems beyond one-off fixes.

Abstract

This paper describes an investigation of the robustness of large language models (LLMs) for retrieval augmented generation (RAG)-based summarization tasks. While LLMs provide summarization capabilities, their performance in complex, real-world scenarios remains under-explored. Our first contribution is LogicSumm, an innovative evaluation framework incorporating realistic scenarios to assess LLM robustness during RAG-based summarization. Based on limitations identified by LogiSumm, we then developed SummRAG, a comprehensive system to create training dialogues and fine-tune a model to enhance robustness within LogicSumm's scenarios. SummRAG is an example of our goal of defining structured methods to test the capabilities of an LLM, rather than addressing issues in a one-off fashion. Experimental results confirm the power of SummRAG, showcasing improved logical coherence and summarization quality. Data, corresponding model weights, and Python code are available online.
Paper Structure (24 sections, 2 equations, 5 figures, 4 tables)

This paper contains 24 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: LogicSumm's pipeline, which divides evaluation into four aspects and seven scenarios
  • Figure 2: Illustration of limitations under LogicSumm
  • Figure 3: Dialogue generation for the top-$\mathbf{1}$ document
  • Figure 4: Dialogue generation for the top-$\mathbf{k}$ documents
  • Figure 5: Example dialogue at each summarization step