Table of Contents
Fetching ...

An Interpretability-Guided Framework for Responsible Synthetic Data Generation in Emotional Text

Paula Joy B. Martinez, Jose Marie Antonio Miñoza, Sebastian C. Ibañez

TL;DR

The paper tackles data scarcity in social-media emotion recognition due to API costs and platform restrictions by introducing an interpretability-guided synthetic-data framework that uses SHAP-derived feature importance to steer LLM-generated samples. It combines exemplars with TF-IDF differential scoring to produce emotion-aligned text and evaluates three augmentation strategies on TweetEval, highlighting that SHAP-guided data can match real-data expansion for sufficiently large seed sets while naïve generation degrades performance. Key findings include robust minority-class gains and context-dependent effectiveness, but a notable trade-off: synthetic data becomes lexically less diverse and less personal, risking reduced generalization over evolving linguistic patterns. The work offers a practical, interpretable approach to responsible synthetic data, emphasizing seed-data thresholds and the necessity of real data to preserve linguistic authenticity, with future directions spanning transformer-based models, cross-platform transferability, and automated quality metrics.

Abstract

Emotion recognition from social media is critical for understanding public sentiment, but accessing training data has become prohibitively expensive due to escalating API costs and platform restrictions. We introduce an interpretability-guided framework where Shapley Additive Explanations (SHAP) provide principled guidance for LLM-based synthetic data generation. With sufficient seed data, SHAP-guided approach matches real data performance, significantly outperforms naïve generation, and substantially improves classification for underrepresented emotion classes. However, our linguistic analysis reveals that synthetic text exhibits reduced vocabulary richness and fewer personal or temporally complex expressions than authentic posts. This work provides both a practical framework for responsible synthetic data generation and a critical perspective on its limitations, underscoring that the future of trustworthy AI depends on navigating the trade-offs between synthetic utility and real-world authenticity.

An Interpretability-Guided Framework for Responsible Synthetic Data Generation in Emotional Text

TL;DR

The paper tackles data scarcity in social-media emotion recognition due to API costs and platform restrictions by introducing an interpretability-guided synthetic-data framework that uses SHAP-derived feature importance to steer LLM-generated samples. It combines exemplars with TF-IDF differential scoring to produce emotion-aligned text and evaluates three augmentation strategies on TweetEval, highlighting that SHAP-guided data can match real-data expansion for sufficiently large seed sets while naïve generation degrades performance. Key findings include robust minority-class gains and context-dependent effectiveness, but a notable trade-off: synthetic data becomes lexically less diverse and less personal, risking reduced generalization over evolving linguistic patterns. The work offers a practical, interpretable approach to responsible synthetic data, emphasizing seed-data thresholds and the necessity of real data to preserve linguistic authenticity, with future directions spanning transformer-based models, cross-platform transferability, and automated quality metrics.

Abstract

Emotion recognition from social media is critical for understanding public sentiment, but accessing training data has become prohibitively expensive due to escalating API costs and platform restrictions. We introduce an interpretability-guided framework where Shapley Additive Explanations (SHAP) provide principled guidance for LLM-based synthetic data generation. With sufficient seed data, SHAP-guided approach matches real data performance, significantly outperforms naïve generation, and substantially improves classification for underrepresented emotion classes. However, our linguistic analysis reveals that synthetic text exhibits reduced vocabulary richness and fewer personal or temporally complex expressions than authentic posts. This work provides both a practical framework for responsible synthetic data generation and a critical perspective on its limitations, underscoring that the future of trustworthy AI depends on navigating the trade-offs between synthetic utility and real-world authenticity.

Paper Structure

This paper contains 30 sections, 3 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: The experimental setup comparing three augmentation strategies: real data expansion, SHAP-guided generation (exemplars + SHAP keywords), and naïve generation (exemplars only). SHAP analysis uses the baseline model to extract emotion-specific keywords. All strategies are evaluated with identical model parameters, data splits, and incremental testing.
  • Figure 2: SHAP-guided synthetic data achieves equivalent performance to real data expansion, while naïve synthetic data degrades consistently, highlighting the importance of principled generation approaches.
  • Figure 3: SHAP-guided augmentation excels for minority class optimism (8.8%), demonstrating how interpretability-guided generation can address class imbalance.
  • Figure 4: All strategies show stable performance for majority class anger (42%), demonstrating that adequate baseline data reduces augmentation sensitivity.
  • Figure 5: With 500-sample baseline, SHAP-guided generation achieves parity with real data initially but plateaus at higher increments, demonstrating that interpretability-guided generation requires adequate seed data for sustained effectiveness.
  • ...and 7 more figures