An Interpretability-Guided Framework for Responsible Synthetic Data Generation in Emotional Text
Paula Joy B. Martinez, Jose Marie Antonio Miñoza, Sebastian C. Ibañez
TL;DR
The paper tackles data scarcity in social-media emotion recognition due to API costs and platform restrictions by introducing an interpretability-guided synthetic-data framework that uses SHAP-derived feature importance to steer LLM-generated samples. It combines exemplars with TF-IDF differential scoring to produce emotion-aligned text and evaluates three augmentation strategies on TweetEval, highlighting that SHAP-guided data can match real-data expansion for sufficiently large seed sets while naïve generation degrades performance. Key findings include robust minority-class gains and context-dependent effectiveness, but a notable trade-off: synthetic data becomes lexically less diverse and less personal, risking reduced generalization over evolving linguistic patterns. The work offers a practical, interpretable approach to responsible synthetic data, emphasizing seed-data thresholds and the necessity of real data to preserve linguistic authenticity, with future directions spanning transformer-based models, cross-platform transferability, and automated quality metrics.
Abstract
Emotion recognition from social media is critical for understanding public sentiment, but accessing training data has become prohibitively expensive due to escalating API costs and platform restrictions. We introduce an interpretability-guided framework where Shapley Additive Explanations (SHAP) provide principled guidance for LLM-based synthetic data generation. With sufficient seed data, SHAP-guided approach matches real data performance, significantly outperforms naïve generation, and substantially improves classification for underrepresented emotion classes. However, our linguistic analysis reveals that synthetic text exhibits reduced vocabulary richness and fewer personal or temporally complex expressions than authentic posts. This work provides both a practical framework for responsible synthetic data generation and a critical perspective on its limitations, underscoring that the future of trustworthy AI depends on navigating the trade-offs between synthetic utility and real-world authenticity.
