Table of Contents
Fetching ...

Data Augmentation Supporting a Conversational Agent Designed for Smoking Cessation Support Groups

Salar Hashemitaheri, Ian Harris

TL;DR

The paper tackles low engagement and stigma in online smoking-cessation support groups by developing a two-level data augmentation framework to improve intent-detection for a conversational agent. Phase One creates high-quality synthetic posts via GPT-4, guided by strict quality screening and human-in-the-loop validation, while Phase Two adds real posts from Ex-Community after cleaning and annotation. The augmented dataset yields a 32% relative improvement in F1 for intent detection, with ~43% of original posts selected for augmentation and 140% synthetic expansion, plus over 10,000 real posts contributing quality data. The approach offers a replicable framework for enhancing conversational agents in data-scarce domains and can help reduce stigma by enabling timely, non-judgmental support in smoking-cessation contexts.

Abstract

Online support groups for smoking cessation are economical and accessible, yet they often face challenges with low user engagement and stigma. The use of an automatic conversational agent would improve engagement by ensuring that all user comments receive a timely response.). We address the challenge of insufficient high-quality data by employing a two-level data augmentation strategy: synthetic data augmentation and real data augmentation. First, we fine-tuned an open source LLM to classify posts from our existing smoking cessation support groups and identify intents with low F1 (precision+recall) scores. Then, for these intents, we generate additional synthetic data using prompt engineering with the GPT model, with an average of 87\% of the generated synthetic posts deemed high quality by human annotators. Overall, the synthetic augmentation process resulted in 43\% of the original posts being selected for augmentation, followed by 140\% synthetic expansion of these posts. Additionally, we scraped more than 10,000 real posts from a related online support context, of which 73\% were validated as good quality by human annotators. Each synthetic or scraped post underwent rigorous validation involving human reviewers to ensure quality and relevance. The validated new data, combined with the original support group posts, formed an augmented dataset used to retrain the intent classifier. Performance evaluation of the retrained model demonstrated a 32\% improvement in F1, confirming the effectiveness of our data augmentation approach. Synthetic and real post augmentation led to similar performance improvements. This study provides a replicable framework for enhancing conversational agent performance in domains where data scarcity is a critical issue.

Data Augmentation Supporting a Conversational Agent Designed for Smoking Cessation Support Groups

TL;DR

The paper tackles low engagement and stigma in online smoking-cessation support groups by developing a two-level data augmentation framework to improve intent-detection for a conversational agent. Phase One creates high-quality synthetic posts via GPT-4, guided by strict quality screening and human-in-the-loop validation, while Phase Two adds real posts from Ex-Community after cleaning and annotation. The augmented dataset yields a 32% relative improvement in F1 for intent detection, with ~43% of original posts selected for augmentation and 140% synthetic expansion, plus over 10,000 real posts contributing quality data. The approach offers a replicable framework for enhancing conversational agents in data-scarce domains and can help reduce stigma by enabling timely, non-judgmental support in smoking-cessation contexts.

Abstract

Online support groups for smoking cessation are economical and accessible, yet they often face challenges with low user engagement and stigma. The use of an automatic conversational agent would improve engagement by ensuring that all user comments receive a timely response.). We address the challenge of insufficient high-quality data by employing a two-level data augmentation strategy: synthetic data augmentation and real data augmentation. First, we fine-tuned an open source LLM to classify posts from our existing smoking cessation support groups and identify intents with low F1 (precision+recall) scores. Then, for these intents, we generate additional synthetic data using prompt engineering with the GPT model, with an average of 87\% of the generated synthetic posts deemed high quality by human annotators. Overall, the synthetic augmentation process resulted in 43\% of the original posts being selected for augmentation, followed by 140\% synthetic expansion of these posts. Additionally, we scraped more than 10,000 real posts from a related online support context, of which 73\% were validated as good quality by human annotators. Each synthetic or scraped post underwent rigorous validation involving human reviewers to ensure quality and relevance. The validated new data, combined with the original support group posts, formed an augmented dataset used to retrain the intent classifier. Performance evaluation of the retrained model demonstrated a 32\% improvement in F1, confirming the effectiveness of our data augmentation approach. Synthetic and real post augmentation led to similar performance improvements. This study provides a replicable framework for enhancing conversational agent performance in domains where data scarcity is a critical issue.

Paper Structure

This paper contains 13 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Conversational Agent
  • Figure 2: Overview of the proposed method
  • Figure 3: Overview of Synthetic Augmentation
  • Figure 4: Overview of Real Augmentation
  • Figure 5: Comparing the F1 Score of Classification for Low-accuracy Intents
  • ...and 2 more figures