Table of Contents
Fetching ...

FuseGen: PLM Fusion for Data-generation based Zero-shot Learning

Tianyuan Zou, Yang Liu, Peng Li, Jianqing Zhang, Jingjing Liu, Ya-Qin Zhang

TL;DR

FuseGen tackles data-quality bias in synthetic datasets used for data-generation-based zero-shot learning by fusing multiple PLMs. It introduces Cross-model Dataset Generation (CDG) to harvest cross-PLM feedback and Cross-model Data Quality Improvement (CDI) with Self-Boosting Weight Adjustment (SWA) to emphasize high-quality samples, avoiding PLM fine-tuning. Across eight NLI/NLU tasks and both open-source and closed-source PLMs, FuseGen consistently outperforms single-PLM baselines, including ProGen and SunGen, while remaining PLM-agnostic. The approach reduces reliance on a single PLM and demonstrates practical impact for resource-constrained settings by improving STM performance with a PLM cluster and efficient weighting. This yields a flexible, query-efficient framework for data-generation-based zero-shot learning in diverse downstream tasks.

Abstract

Data generation-based zero-shot learning, although effective in training Small Task-specific Models (STMs) via synthetic datasets generated by Pre-trained Language Models (PLMs), is often limited by the low quality of such synthetic datasets. Previous solutions have primarily focused on single PLM settings, where synthetic datasets are typically restricted to specific sub-spaces and often deviate from real-world distributions, leading to severe distribution bias. To mitigate such bias, we propose FuseGen, a novel data generation-based zero-shot learning framework that introduces a new criteria for subset selection from synthetic datasets via utilizing multiple PLMs and trained STMs. The chosen subset provides in-context feedback to each PLM, enhancing dataset quality through iterative data generation. Trained STMs are then used for sample re-weighting as well, further improving data quality. Extensive experiments across diverse tasks demonstrate that FuseGen substantially outperforms existing methods, highly effective in boosting STM performance in a PLM-agnostic way. Code is provided in https://github.com/LindaLydia/FuseGen.

FuseGen: PLM Fusion for Data-generation based Zero-shot Learning

TL;DR

FuseGen tackles data-quality bias in synthetic datasets used for data-generation-based zero-shot learning by fusing multiple PLMs. It introduces Cross-model Dataset Generation (CDG) to harvest cross-PLM feedback and Cross-model Data Quality Improvement (CDI) with Self-Boosting Weight Adjustment (SWA) to emphasize high-quality samples, avoiding PLM fine-tuning. Across eight NLI/NLU tasks and both open-source and closed-source PLMs, FuseGen consistently outperforms single-PLM baselines, including ProGen and SunGen, while remaining PLM-agnostic. The approach reduces reliance on a single PLM and demonstrates practical impact for resource-constrained settings by improving STM performance with a PLM cluster and efficient weighting. This yields a flexible, query-efficient framework for data-generation-based zero-shot learning in diverse downstream tasks.

Abstract

Data generation-based zero-shot learning, although effective in training Small Task-specific Models (STMs) via synthetic datasets generated by Pre-trained Language Models (PLMs), is often limited by the low quality of such synthetic datasets. Previous solutions have primarily focused on single PLM settings, where synthetic datasets are typically restricted to specific sub-spaces and often deviate from real-world distributions, leading to severe distribution bias. To mitigate such bias, we propose FuseGen, a novel data generation-based zero-shot learning framework that introduces a new criteria for subset selection from synthetic datasets via utilizing multiple PLMs and trained STMs. The chosen subset provides in-context feedback to each PLM, enhancing dataset quality through iterative data generation. Trained STMs are then used for sample re-weighting as well, further improving data quality. Extensive experiments across diverse tasks demonstrate that FuseGen substantially outperforms existing methods, highly effective in boosting STM performance in a PLM-agnostic way. Code is provided in https://github.com/LindaLydia/FuseGen.
Paper Structure (28 sections, 5 equations, 8 figures, 12 tables, 2 algorithms)

This paper contains 28 sections, 5 equations, 8 figures, 12 tables, 2 algorithms.

Figures (8)

  • Figure 1: Synthetic dataset cartography swayamdipta2020dataset using $1,000$ samples generated by Llama-2 and Flan-T5 for movie review semantic analysis. ZeroGen ye2022zerogen uses zero-shot prompt for generation, while ProGen ye2022progen and FuseGen (Ours) use few-shot prompt with feedback, with ProGen relying on a single PLM and FuseGen leveraging multiple PLMs. $K$ is the number of PLMs. Numbers within parentheses are the results of STM trained with Self-boosting Weight Adjustment (see \ref{['subsec:methodology_data_quality_improvement']}) and evaluated over IMDb maas2011learning_imdb dataset. Results for more PLMs are provided in \ref{['fig:appendix_dataset_cartography']} in \ref{['subsec:appendix_dataset_cartography']}.
  • Figure 2: Performance of STM trained using $6,000$ synthetic data samples generated by various PLMs. "mixed" uses a dataset comprising $6,000$ total samples given by the $6$ listed PLMs ($1,000$ samples per PLM). "FuseGen" (Ours) uses the $6$ listed PLMs and $6,000$ samples.
  • Figure 3: Illustrated Workflow of FuseGen with two components: Cross-model Data Generation (CDG) and Cross-model Data Quality Improvement (CDI). CDG iteratively executes parallel synthetic data generation, cross-model data quality evaluation and cross-PLM in-context learning. CDI implements self-boosting weight adjustment for sample-reweighted training of STM.
  • Figure 4: Comparison of FuseGen between using multi-PLM (last bar) and single-PLM with QNLI dataset.
  • Figure 5: Ablation results on different hyper-parameters used for FuseGen with QNLI as test dataset.
  • ...and 3 more figures