Improving Sentence Embeddings with Automatic Generation of Training Data Using Few-shot Examples
Soma Sato, Hayato Tsukagoshi, Ryohei Sasano, Koichi Takeda
TL;DR
This work addresses the reliance on large manually annotated NLI data for high-quality sentence embeddings by automatically generating NLI datasets with a decoder-based LLM using few-shot prompts and using the generated data to fine-tune PromptEOL. Through systematic exploration of few-shot strategies, especially a 5-shot×4-setup, the authors demonstrate that diversification and quality of generated hypotheses significantly boost STS performance, surpassing unsupervised baselines and approaching manual-NLI performance. The contributions include identifying effective few-shot data generation strategies for NLI and achieving state-of-the-art STS results in settings without large manually annotated datasets, thereby reducing labeling burden. The approach has practical impact for scalable, high-quality sentence embeddings in resource-constrained settings and lays groundwork for broader data-generation applications in NLP.
Abstract
Decoder-based large language models (LLMs) have shown high performance on many tasks in natural language processing. This is also true for sentence embedding learning, where a decoder-based model, PromptEOL, has achieved the best performance on semantic textual similarity (STS) tasks. However, PromptEOL requires a manually annotated natural language inference (NLI) dataset for fine-tuning. We aim to improve sentence embeddings without using large manually annotated datasets by automatically generating an NLI dataset with an LLM and using it for fine-tuning of PromptEOL. To achieve this, we explore methods of data generation suitable for sentence embedding learning in this study. Specifically, we will focus on automatic dataset generation through few-shot learning and explore the appropriate methods to leverage few-shot examples. Experimental results on the STS tasks demonstrate that our approach outperforms existing models in settings without large manually annotated datasets.
