Towards a Robust Retrieval-Based Summarization System
Shengjie Liu, Jing Wu, Jingyuan Bao, Wenyi Wang, Naira Hovakimyan, Christopher G Healey
TL;DR
This work addresses the robustness of retrieval-augmented summarization with large language models by introducing LogicSumm, a seven-scenario evaluation pipeline that stress-tests RAG-based summarization across diverse retrieval conditions. Building on LogicSumm, SummRAG provides an end-to-end framework for generating training dialogues and fine-tuning models, notably using LoRA to adapt Mistral-7B Instruct, aiming to achieve robust performance close to GPT-4 Turbo. Experimental results demonstrate improved logical coherence and summarization quality, particularly in multi-document settings where irrelevant content can derail succinct summaries. The authors release data, model weights, and code, proposing a structured, generalizable approach to strengthening LLM-based RAG systems beyond one-off fixes.
Abstract
This paper describes an investigation of the robustness of large language models (LLMs) for retrieval augmented generation (RAG)-based summarization tasks. While LLMs provide summarization capabilities, their performance in complex, real-world scenarios remains under-explored. Our first contribution is LogicSumm, an innovative evaluation framework incorporating realistic scenarios to assess LLM robustness during RAG-based summarization. Based on limitations identified by LogiSumm, we then developed SummRAG, a comprehensive system to create training dialogues and fine-tune a model to enhance robustness within LogicSumm's scenarios. SummRAG is an example of our goal of defining structured methods to test the capabilities of an LLM, rather than addressing issues in a one-off fashion. Experimental results confirm the power of SummRAG, showcasing improved logical coherence and summarization quality. Data, corresponding model weights, and Python code are available online.
