A Guide To Effectively Leveraging LLMs for Low-Resource Text Summarization: Data Augmentation and Semi-supervised Approaches
Gaurav Sahu, Olga Vechtomova, Issam H. Laradji
TL;DR
This work targets low-resource text summarization by introducing MixSumm and PPSL, two strategies that leverage open LLMs to augment data and generate high-quality pseudo-labels. MixSumm synthesizes diverse documents by mixing topics from different topical clusters and then produces extractive and abstractive summaries, while PPSL uses a teacher model to generate pseudo-labels that are refined and rated by LLMs in multiple stages. The approach demonstrates competitive ROUGE and L-Eval performance with only about 5% of labeled data across TweetSumm, WikiHow, and ArXiv/PubMed, and it supports effective knowledge distillation from LLaMA-3-70b-Instruct to smaller models like BERT-base and DistilBART. By combining data augmentation and semi-supervised learning with open-source LLMs, the paper advances practical, scalable low-resource summarization and suggests directions for broader multi-language and long-document handling, with attention to ethical aspects of synthetic data generation.
Abstract
Existing approaches for low-resource text summarization primarily employ large language models (LLMs) like GPT-3 or GPT-4 at inference time to generate summaries directly; however, such approaches often suffer from inconsistent LLM outputs and are difficult to adapt to domain-specific data in low-resource scenarios. In this work, we propose two novel methods to effectively utilize LLMs for low-resource text summarization: 1) MixSumm, an LLM-based data augmentation regime that synthesizes high-quality documents (short and long) for few-shot text summarization, and 2) PPSL, a prompt-based pseudolabeling strategy for sample-efficient semi-supervised text summarization. Specifically, MixSumm leverages the open-source LLaMA-3-70b-Instruct model to generate new documents by mixing topical information derived from a small seed set, and PPSL leverages the LLaMA-3-70b-Instruct model to generate high-quality pseudo-labels in a semi-supervised learning setup. We evaluate our methods on the TweetSumm, WikiHow, and ArXiv/PubMed datasets and use L-Eval, a LLaMA-3-based evaluation metric, and ROUGE scores to measure the quality of generated summaries. Our experiments on extractive and abstractive summarization show that MixSumm and PPSL achieve competitive ROUGE scores as a fully supervised method with 5% of the labeled data.
