Fine-Tuning LLMs for Report Summarization: Analysis on Supervised and Unsupervised Data
Swati Rallapalli, Shannon Gallagher, Andrew O. Mellinger, Jasmine Ratchford, Anusha Sinha, Tyler Brooks, William R. Nichols, Nick Winski, Bryan Brown
TL;DR
This work tackles on-premise fine-tuning of LLMs for government-like report summarization under ground-truth scarcity and limited compute. It compares Knowledge Fine-tuning (KFT) and Format Fine-tuning (FFT) using LoRA and Deepspeed on Llama 7B and T5 Small across National Archives, Kaggle News Summary, and Newsroom data, with a rigorous evaluation framework including ROUGE/BLEU/METEOR/BERTScore, Topic Similarity via LDA, AlignScore, and targeted human inspection. Key findings show that KFT reduces invalid summaries but may not always surpass foundation models on valid content, while FFT yields consistent improvements on news datasets; TS and AlignScore provide complementary insights into factual and topical quality. The study demonstrates the practical feasibility of on-premise fine-tuning under resource constraints and offers a structured evaluation approach for scenarios with scarce ground-truth data, with plans to release data and models for research use.
Abstract
We study the efficacy of fine-tuning Large Language Models (LLMs) for the specific task of report (government archives, news, intelligence reports) summarization. While this topic is being very actively researched - our specific application set-up faces two challenges: (i) ground-truth summaries maybe unavailable (e.g., for government archives), and (ii) availability of limited compute power - the sensitive nature of the application requires that computation is performed on-premise and for most of our experiments we use one or two A100 GPU cards. Under this set-up we conduct experiments to answer the following questions. First, given that fine-tuning the LLMs can be resource intensive, is it feasible to fine-tune them for improved report summarization capabilities on-premise? Second, what are the metrics we could leverage to assess the quality of these summaries? We conduct experiments on two different fine-tuning approaches in parallel and our findings reveal interesting trends regarding the utility of fine-tuning LLMs. Specifically, we find that in many cases, fine-tuning helps improve summary quality and in other cases it helps by reducing the number of invalid or garbage summaries.
