Table of Contents
Fetching ...

Prompt-based Pseudo-labeling Strategy for Sample-Efficient Semi-Supervised Extractive Summarization

Gaurav Sahu, Olga Vechtomova, Issam H. Laradji

TL;DR

The paper tackles semi-supervised extractive summarization under limited labeled data by introducing a four-stage pipeline that leverages a PreSumm teacher, prompt-based relabeling, and prompt-based scoring LLMs to curate high-quality pseudo-labels across cycles. It demonstrates that this approach yields ROUGE improvements and competitive L-Eval results on TweetSumm, WikiHow, and ArXiv/PubMed, while using far fewer labeled examples than fully supervised models. The key contributions are the prompt-based pseudo-label selection strategy, the LLM-driven relabeling mechanism, and empirical evidence of improved sample efficiency and cross-domain robustness. In data-rich settings, the method can even outperform fully supervised baselines, highlighting its practical potential for scalable extractive summarization.

Abstract

Semi-supervised learning (SSL) is a widely used technique in scenarios where labeled data is scarce and unlabeled data is abundant. While SSL is popular for image and text classification, it is relatively underexplored for the task of extractive text summarization. Standard SSL methods follow a teacher-student paradigm to first train a classification model and then use the classifier's confidence values to select pseudo-labels for the subsequent training cycle; however, such classifiers are not suitable to measure the accuracy of pseudo-labels as they lack specific tuning for evaluation, which leads to confidence values that fail to capture the semantics and correctness of the generated summary. To address this problem, we propose a prompt-based pseudo-labeling strategy with LLMs that picks unlabeled examples with more accurate pseudo-labels than using just the classifier's probability outputs. Our approach also includes a relabeling mechanism that improves the quality of pseudo-labels. We evaluate our method on three text summarization datasets: TweetSumm, WikiHow, and ArXiv/PubMed. We empirically show that a prompting-based LLM that scores and generates pseudo-labels outperforms existing SSL methods on ROUGE-1, ROUGE-2, and ROUGE-L scores on all the datasets. Furthermore, our method achieves competitive L-Eval scores (evaluation with LLaMa-3) as a fully supervised method in a data-scarce setting and outperforms fully supervised method in a data-abundant setting.

Prompt-based Pseudo-labeling Strategy for Sample-Efficient Semi-Supervised Extractive Summarization

TL;DR

The paper tackles semi-supervised extractive summarization under limited labeled data by introducing a four-stage pipeline that leverages a PreSumm teacher, prompt-based relabeling, and prompt-based scoring LLMs to curate high-quality pseudo-labels across cycles. It demonstrates that this approach yields ROUGE improvements and competitive L-Eval results on TweetSumm, WikiHow, and ArXiv/PubMed, while using far fewer labeled examples than fully supervised models. The key contributions are the prompt-based pseudo-label selection strategy, the LLM-driven relabeling mechanism, and empirical evidence of improved sample efficiency and cross-domain robustness. In data-rich settings, the method can even outperform fully supervised baselines, highlighting its practical potential for scalable extractive summarization.

Abstract

Semi-supervised learning (SSL) is a widely used technique in scenarios where labeled data is scarce and unlabeled data is abundant. While SSL is popular for image and text classification, it is relatively underexplored for the task of extractive text summarization. Standard SSL methods follow a teacher-student paradigm to first train a classification model and then use the classifier's confidence values to select pseudo-labels for the subsequent training cycle; however, such classifiers are not suitable to measure the accuracy of pseudo-labels as they lack specific tuning for evaluation, which leads to confidence values that fail to capture the semantics and correctness of the generated summary. To address this problem, we propose a prompt-based pseudo-labeling strategy with LLMs that picks unlabeled examples with more accurate pseudo-labels than using just the classifier's probability outputs. Our approach also includes a relabeling mechanism that improves the quality of pseudo-labels. We evaluate our method on three text summarization datasets: TweetSumm, WikiHow, and ArXiv/PubMed. We empirically show that a prompting-based LLM that scores and generates pseudo-labels outperforms existing SSL methods on ROUGE-1, ROUGE-2, and ROUGE-L scores on all the datasets. Furthermore, our method achieves competitive L-Eval scores (evaluation with LLaMa-3) as a fully supervised method in a data-scarce setting and outperforms fully supervised method in a data-abundant setting.
Paper Structure (16 sections, 1 equation, 5 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 1 equation, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: L-Eval scores of semi-supervised v/s fully supervised models in a low-data setting. The proposed semi-supervised method (middle column for each dataset) combines PreSumm's confidence with GPT-4's knowledge to generate better pseudo-labels. The proposed method performs competitively on the three summarization datasets while using 6$\times$ fewer labels than a fully supervised method (rightmost column for each dataset)
  • Figure 2: Proposed method. Our approach has four main stages. Stage 1: train a teacher model $M$ (PreSumm) on the limited labeled dataset. Stage 2: generate pseudo-labels for the unlabeled set with $M$ and shortlist 50 based on teacher confidence (see Equation \ref{['eq:conf']}). Stage 3: prompt an off-the-shelf LLM like GPT-4 to generate an extractive summary of the shortlisted documents from Stage 2. Stage 4: score the pseudo-labels in Stage 3 by prompting an LLM and select the top 5. These summaries are then added to the training data for the next iteration.
  • Figure 3: Different prompts used in the experiments.
  • Figure 4: Quality of pseudo-labels by different strategies (data-scarce setup). The y-axis denotes the ROUGE-2 scores of the top 5 pseudo-labels computed against the respective ground truths. All results are for BERT$_{base}$ as the backbone for PreSumm and three random seeds. Refer to Section \ref{['sec:pseudo-quality']} for complete details.
  • Figure 5: ROUGE-1 curves v/s # cycles for data-scarce setting. Each cycle denotes an addition of 5 new pseudo-labels to the training set. All results use BERT$_{base}$ as the backbone for PreSumm. The curves are averaged for three seeds (the width denotes the std). Note that we report the GPT-4 version of our method here.