Investigating Cost-Efficiency of LLM-Generated Training Data for Conversational Semantic Frame Analysis

Shiho Matta; Yin Jou Huang; Fei Cheng; Hirokazu Kiyomaru; Yugo Murawaki

Investigating Cost-Efficiency of LLM-Generated Training Data for Conversational Semantic Frame Analysis

Shiho Matta, Yin Jou Huang, Fei Cheng, Hirokazu Kiyomaru, Yugo Murawaki

TL;DR

Experiments reveal that optimal cost-efficiency is achieved by combining both human and LLM-generated data across a wide range of budget levels, and as the budget decreases, as the budget decreases, a higher proportion of LLM-generated data becomes more preferable.

Abstract

Recent studies have demonstrated that few-shot learning allows LLMs to generate training data for supervised models at a low cost. However, the quality of LLM-generated data may not entirely match that of human-labeled data. This raises a crucial question: how should one balance the trade-off between the higher quality but more expensive human data and the lower quality yet substantially cheaper LLM-generated data? In this paper, we synthesized training data for conversational semantic frame analysis using GPT-4 and examined how to allocate budgets optimally to achieve the best performance. Our experiments, conducted across various budget levels, reveal that optimal cost-efficiency is achieved by combining both human and LLM-generated data across a wide range of budget levels. Notably, as the budget decreases, a higher proportion of LLM-generated data becomes more preferable.

Investigating Cost-Efficiency of LLM-Generated Training Data for Conversational Semantic Frame Analysis

TL;DR

Abstract

Investigating Cost-Efficiency of LLM-Generated Training Data for Conversational Semantic Frame Analysis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)