Table of Contents
Fetching ...

Improving Sentence Embeddings with Automatic Generation of Training Data Using Few-shot Examples

Soma Sato, Hayato Tsukagoshi, Ryohei Sasano, Koichi Takeda

TL;DR

This work addresses the reliance on large manually annotated NLI data for high-quality sentence embeddings by automatically generating NLI datasets with a decoder-based LLM using few-shot prompts and using the generated data to fine-tune PromptEOL. Through systematic exploration of few-shot strategies, especially a 5-shot×4-setup, the authors demonstrate that diversification and quality of generated hypotheses significantly boost STS performance, surpassing unsupervised baselines and approaching manual-NLI performance. The contributions include identifying effective few-shot data generation strategies for NLI and achieving state-of-the-art STS results in settings without large manually annotated datasets, thereby reducing labeling burden. The approach has practical impact for scalable, high-quality sentence embeddings in resource-constrained settings and lays groundwork for broader data-generation applications in NLP.

Abstract

Decoder-based large language models (LLMs) have shown high performance on many tasks in natural language processing. This is also true for sentence embedding learning, where a decoder-based model, PromptEOL, has achieved the best performance on semantic textual similarity (STS) tasks. However, PromptEOL requires a manually annotated natural language inference (NLI) dataset for fine-tuning. We aim to improve sentence embeddings without using large manually annotated datasets by automatically generating an NLI dataset with an LLM and using it for fine-tuning of PromptEOL. To achieve this, we explore methods of data generation suitable for sentence embedding learning in this study. Specifically, we will focus on automatic dataset generation through few-shot learning and explore the appropriate methods to leverage few-shot examples. Experimental results on the STS tasks demonstrate that our approach outperforms existing models in settings without large manually annotated datasets.

Improving Sentence Embeddings with Automatic Generation of Training Data Using Few-shot Examples

TL;DR

This work addresses the reliance on large manually annotated NLI data for high-quality sentence embeddings by automatically generating NLI datasets with a decoder-based LLM using few-shot prompts and using the generated data to fine-tune PromptEOL. Through systematic exploration of few-shot strategies, especially a 5-shot×4-setup, the authors demonstrate that diversification and quality of generated hypotheses significantly boost STS performance, surpassing unsupervised baselines and approaching manual-NLI performance. The contributions include identifying effective few-shot data generation strategies for NLI and achieving state-of-the-art STS results in settings without large manually annotated datasets, thereby reducing labeling burden. The approach has practical impact for scalable, high-quality sentence embeddings in resource-constrained settings and lays groundwork for broader data-generation applications in NLP.

Abstract

Decoder-based large language models (LLMs) have shown high performance on many tasks in natural language processing. This is also true for sentence embedding learning, where a decoder-based model, PromptEOL, has achieved the best performance on semantic textual similarity (STS) tasks. However, PromptEOL requires a manually annotated natural language inference (NLI) dataset for fine-tuning. We aim to improve sentence embeddings without using large manually annotated datasets by automatically generating an NLI dataset with an LLM and using it for fine-tuning of PromptEOL. To achieve this, we explore methods of data generation suitable for sentence embedding learning in this study. Specifically, we will focus on automatic dataset generation through few-shot learning and explore the appropriate methods to leverage few-shot examples. Experimental results on the STS tasks demonstrate that our approach outperforms existing models in settings without large manually annotated datasets.
Paper Structure (22 sections, 3 figures, 9 tables)

This paper contains 22 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Performances of different few-shot settings
  • Figure 2: Performances of models fine-tuned with the automatically generated datasets and existing methods
  • Figure 3: Frequency distribution of token counts in the manual NLI dataset