iGAiVA: Integrated Generative AI and Visual Analytics in a Machine Learning Workflow for Text Classification
Yuanzhe Jin, Adrian Carrasco-Revilla, Min Chen
TL;DR
The paper addresses data distribution challenges in text classification by introducing a VIS4ML workflow that uses visual analytics to guide large-language-model–based synthetic data generation. It presents iGAiVA, a four-view tool that integrates data synthesis, data management, model training, and results analysis in an iterative loop. Through VA techniques (t-SNE, PCA, RBF, tag-treemap), the approach targets data-deficiency regions, achieving targeted recall improvements (e.g., T13 from 18% to 31%) with modest overall gains and positive industry feedback. The work demonstrates a practical, human-in-the-loop pathway for data-efficient model improvement in real-world text analytics and points to future industrial deployment and broader generalization.
Abstract
In developing machine learning (ML) models for text classification, one common challenge is that the collected data is often not ideally distributed, especially when new classes are introduced in response to changes of data and tasks. In this paper, we present a solution for using visual analytics (VA) to guide the generation of synthetic data using large language models. As VA enables model developers to identify data-related deficiency, data synthesis can be targeted to address such deficiency. We discuss different types of data deficiency, describe different VA techniques for supporting their identification, and demonstrate the effectiveness of targeted data synthesis in improving model accuracy. In addition, we present a software tool, iGAiVA, which maps four groups of ML tasks into four VA views, integrating generative AI and VA into an ML workflow for developing and improving text classification models.
