Table of Contents
Fetching ...

A Guide To Effectively Leveraging LLMs for Low-Resource Text Summarization: Data Augmentation and Semi-supervised Approaches

Gaurav Sahu, Olga Vechtomova, Issam H. Laradji

TL;DR

This work targets low-resource text summarization by introducing MixSumm and PPSL, two strategies that leverage open LLMs to augment data and generate high-quality pseudo-labels. MixSumm synthesizes diverse documents by mixing topics from different topical clusters and then produces extractive and abstractive summaries, while PPSL uses a teacher model to generate pseudo-labels that are refined and rated by LLMs in multiple stages. The approach demonstrates competitive ROUGE and L-Eval performance with only about 5% of labeled data across TweetSumm, WikiHow, and ArXiv/PubMed, and it supports effective knowledge distillation from LLaMA-3-70b-Instruct to smaller models like BERT-base and DistilBART. By combining data augmentation and semi-supervised learning with open-source LLMs, the paper advances practical, scalable low-resource summarization and suggests directions for broader multi-language and long-document handling, with attention to ethical aspects of synthetic data generation.

Abstract

Existing approaches for low-resource text summarization primarily employ large language models (LLMs) like GPT-3 or GPT-4 at inference time to generate summaries directly; however, such approaches often suffer from inconsistent LLM outputs and are difficult to adapt to domain-specific data in low-resource scenarios. In this work, we propose two novel methods to effectively utilize LLMs for low-resource text summarization: 1) MixSumm, an LLM-based data augmentation regime that synthesizes high-quality documents (short and long) for few-shot text summarization, and 2) PPSL, a prompt-based pseudolabeling strategy for sample-efficient semi-supervised text summarization. Specifically, MixSumm leverages the open-source LLaMA-3-70b-Instruct model to generate new documents by mixing topical information derived from a small seed set, and PPSL leverages the LLaMA-3-70b-Instruct model to generate high-quality pseudo-labels in a semi-supervised learning setup. We evaluate our methods on the TweetSumm, WikiHow, and ArXiv/PubMed datasets and use L-Eval, a LLaMA-3-based evaluation metric, and ROUGE scores to measure the quality of generated summaries. Our experiments on extractive and abstractive summarization show that MixSumm and PPSL achieve competitive ROUGE scores as a fully supervised method with 5% of the labeled data.

A Guide To Effectively Leveraging LLMs for Low-Resource Text Summarization: Data Augmentation and Semi-supervised Approaches

TL;DR

This work targets low-resource text summarization by introducing MixSumm and PPSL, two strategies that leverage open LLMs to augment data and generate high-quality pseudo-labels. MixSumm synthesizes diverse documents by mixing topics from different topical clusters and then produces extractive and abstractive summaries, while PPSL uses a teacher model to generate pseudo-labels that are refined and rated by LLMs in multiple stages. The approach demonstrates competitive ROUGE and L-Eval performance with only about 5% of labeled data across TweetSumm, WikiHow, and ArXiv/PubMed, and it supports effective knowledge distillation from LLaMA-3-70b-Instruct to smaller models like BERT-base and DistilBART. By combining data augmentation and semi-supervised learning with open-source LLMs, the paper advances practical, scalable low-resource summarization and suggests directions for broader multi-language and long-document handling, with attention to ethical aspects of synthetic data generation.

Abstract

Existing approaches for low-resource text summarization primarily employ large language models (LLMs) like GPT-3 or GPT-4 at inference time to generate summaries directly; however, such approaches often suffer from inconsistent LLM outputs and are difficult to adapt to domain-specific data in low-resource scenarios. In this work, we propose two novel methods to effectively utilize LLMs for low-resource text summarization: 1) MixSumm, an LLM-based data augmentation regime that synthesizes high-quality documents (short and long) for few-shot text summarization, and 2) PPSL, a prompt-based pseudolabeling strategy for sample-efficient semi-supervised text summarization. Specifically, MixSumm leverages the open-source LLaMA-3-70b-Instruct model to generate new documents by mixing topical information derived from a small seed set, and PPSL leverages the LLaMA-3-70b-Instruct model to generate high-quality pseudo-labels in a semi-supervised learning setup. We evaluate our methods on the TweetSumm, WikiHow, and ArXiv/PubMed datasets and use L-Eval, a LLaMA-3-based evaluation metric, and ROUGE scores to measure the quality of generated summaries. Our experiments on extractive and abstractive summarization show that MixSumm and PPSL achieve competitive ROUGE scores as a fully supervised method with 5% of the labeled data.
Paper Structure (43 sections, 2 equations, 6 figures, 7 tables)

This paper contains 43 sections, 2 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: L-Eval scores of different methods on low-resource extractive text summarization. The proposed MixSumm approach generates new documents by combining topics from multiple examples and outperforms other methods, including a strong LLM-based DA method (MixSumm w/o mixup) and a prompt-based semi-supervised approach (PPSL).
  • Figure 2: MixSumm pipeline. We first group the documents into $T$ groups using the $k$-means algorithm. Then, we construct the prompt for LLaMA-3-70b-Instruct by including documents from different groups and instructing the LLM to mix information from multiple topics when generating the new documents. Finally, we train a PreSumm extractive summarizer liu-lapata-2019-text on the combined seed and the synthesized dataset. For abstractive summarization, we add a DistilBART model after PreSumm.
  • Figure 3: PPSL pipeline.Step 1: train a teacher model $M$ on the limited labeled dataset. Step 2: generate pseudo-labels for the unlabeled set with $M$ and shortlist 50 based on teacher confidence (see Equation \ref{['eq:conf']}). Step 3: prompt an LLM to summarize the shortlisted documents. Step 4: score the pseudo-labels in Stage 3 by prompting an LLM and select the top 5. These summaries are then added to the training data for the next cycle.
  • Figure 4: ROUGE-1 curves v/s # cycles for data-scarce setting. Each cycle denotes an addition of 5 new pseudo-labels to the training set. All results use BERT$_{base}$ as the backbone for PreSumm. The curves are averaged for three seeds (the width denotes the std). Note that we report the GPT-4 version of our method here.
  • Figure 5: Quality of pseudo-labels by different strategies (data-scarce setup). The y-axis denotes the ROUGE-2 scores of the top 5 pseudo-labels computed against the respective ground truths. All results are for BERT$_{base}$ as the backbone for PreSumm and three random seeds. Refer to Section \ref{['sec:pseudo-quality']} for complete details.
  • ...and 1 more figures