Table of Contents
Fetching ...

Generation-driven Contrastive Self-training for Zero-shot Text Classification with Instruction-following LLM

Ruohong Zhang, Yau-Shian Wang, Yiming Yang

TL;DR

GenCo introduces Generation-driven Contrastive Self-training to tackle zero-shot text classification by embedding an instruction-following LLM into the self-training loop of a smaller encoder. It uses LLM-generated semantic enrichments and conditional augmentation to produce high-quality pseudo-labels and training pairs, paired with a contrastive loss that blends soft-labeling and entropy regularization. Across four benchmark datasets, GenCo surpasses strong self-training baselines and even Alpaca-7B with human prompts when labeled data is scarce, while offering substantial computational efficiency. This approach demonstrates the practical value of integrating generative LLMs into iterative, data-efficient self-training pipelines for domain-adaptive text classification.

Abstract

The remarkable performance of large language models (LLMs) in zero-shot language understanding has garnered significant attention. However, employing LLMs for large-scale inference or domain-specific fine-tuning requires immense computational resources due to their substantial model size. To overcome these limitations, we introduce a novel method, namely GenCo, which leverages the strong generative power of LLMs to assist in training a smaller and more adaptable language model. In our method, an LLM plays an important role in the self-training loop of a smaller model in two important ways. Firstly, the LLM is used to augment each input instance with a variety of possible continuations, enriching its semantic context for better understanding. Secondly, it helps crafting additional high-quality training pairs, by rewriting input texts conditioned on predicted labels. This ensures the generated texts are highly relevant to the predicted labels, alleviating the prediction error during pseudo-labeling, while reducing the dependency on large volumes of unlabeled text. In our experiments, GenCo outperforms previous state-of-the-art methods when only limited ($<5\%$ of original) in-domain text data is available. Notably, our approach surpasses the performance of Alpaca-7B with human prompts, highlighting the potential of leveraging LLM for self-training.

Generation-driven Contrastive Self-training for Zero-shot Text Classification with Instruction-following LLM

TL;DR

GenCo introduces Generation-driven Contrastive Self-training to tackle zero-shot text classification by embedding an instruction-following LLM into the self-training loop of a smaller encoder. It uses LLM-generated semantic enrichments and conditional augmentation to produce high-quality pseudo-labels and training pairs, paired with a contrastive loss that blends soft-labeling and entropy regularization. Across four benchmark datasets, GenCo surpasses strong self-training baselines and even Alpaca-7B with human prompts when labeled data is scarce, while offering substantial computational efficiency. This approach demonstrates the practical value of integrating generative LLMs into iterative, data-efficient self-training pipelines for domain-adaptive text classification.

Abstract

The remarkable performance of large language models (LLMs) in zero-shot language understanding has garnered significant attention. However, employing LLMs for large-scale inference or domain-specific fine-tuning requires immense computational resources due to their substantial model size. To overcome these limitations, we introduce a novel method, namely GenCo, which leverages the strong generative power of LLMs to assist in training a smaller and more adaptable language model. In our method, an LLM plays an important role in the self-training loop of a smaller model in two important ways. Firstly, the LLM is used to augment each input instance with a variety of possible continuations, enriching its semantic context for better understanding. Secondly, it helps crafting additional high-quality training pairs, by rewriting input texts conditioned on predicted labels. This ensures the generated texts are highly relevant to the predicted labels, alleviating the prediction error during pseudo-labeling, while reducing the dependency on large volumes of unlabeled text. In our experiments, GenCo outperforms previous state-of-the-art methods when only limited ( of original) in-domain text data is available. Notably, our approach surpasses the performance of Alpaca-7B with human prompts, highlighting the potential of leveraging LLM for self-training.
Paper Structure (24 sections, 3 theorems, 23 equations, 4 figures, 6 tables)

This paper contains 24 sections, 3 theorems, 23 equations, 4 figures, 6 tables.

Key Result

Theorem 1

Consider a binary classification problem with linearly separable labeled examples. When $0<\tau<1$, optimizing equation eq:loss with gradient descend will enforce the larger margin between classes and achieves max margin classifier under certain constraint.

Figures (4)

  • Figure 1: Enriching textual semantics through LLM Generation: The input text and an instruction are fed into the LLM to generate multiple pieces of elaborated texts, each of which is concatenated to the original input to obtain an augmented text. The embeddings of the augmented texts are then averaged to obtain a merged embedding, which is used for label prediction and contrastive loss in the self-training process.
  • Figure 2: Conditional text augmentation to address mislabeling in self-training: When a pseudo label is incorrect, it can mislead the training process and decrease classification performance. We generate augmented text conditioned on the pseudo label, aiming to make the generated text closer to the majority members in the category of the pseudo label. This approach aims to improve the quality of the generated instances for self-training.
  • Figure 3: Per class F1 (upper) and ranking-based precision (lower) for classification performance with input augmentation.
  • Figure 4: The left figure shows a heatmap of the probability when a conditionally generated text based on pseudo label aligns with each of the label prompts. The right figure shows the distribution of the generated text plotted using T-SNE (sports category is out of scope).

Theorems & Definitions (5)

  • Theorem 1
  • Theorem 2
  • proof
  • Theorem 3
  • proof