Table of Contents
Fetching ...

iGAiVA: Integrated Generative AI and Visual Analytics in a Machine Learning Workflow for Text Classification

Yuanzhe Jin, Adrian Carrasco-Revilla, Min Chen

TL;DR

The paper addresses data distribution challenges in text classification by introducing a VIS4ML workflow that uses visual analytics to guide large-language-model–based synthetic data generation. It presents iGAiVA, a four-view tool that integrates data synthesis, data management, model training, and results analysis in an iterative loop. Through VA techniques (t-SNE, PCA, RBF, tag-treemap), the approach targets data-deficiency regions, achieving targeted recall improvements (e.g., T13 from 18% to 31%) with modest overall gains and positive industry feedback. The work demonstrates a practical, human-in-the-loop pathway for data-efficient model improvement in real-world text analytics and points to future industrial deployment and broader generalization.

Abstract

In developing machine learning (ML) models for text classification, one common challenge is that the collected data is often not ideally distributed, especially when new classes are introduced in response to changes of data and tasks. In this paper, we present a solution for using visual analytics (VA) to guide the generation of synthetic data using large language models. As VA enables model developers to identify data-related deficiency, data synthesis can be targeted to address such deficiency. We discuss different types of data deficiency, describe different VA techniques for supporting their identification, and demonstrate the effectiveness of targeted data synthesis in improving model accuracy. In addition, we present a software tool, iGAiVA, which maps four groups of ML tasks into four VA views, integrating generative AI and VA into an ML workflow for developing and improving text classification models.

iGAiVA: Integrated Generative AI and Visual Analytics in a Machine Learning Workflow for Text Classification

TL;DR

The paper addresses data distribution challenges in text classification by introducing a VIS4ML workflow that uses visual analytics to guide large-language-model–based synthetic data generation. It presents iGAiVA, a four-view tool that integrates data synthesis, data management, model training, and results analysis in an iterative loop. Through VA techniques (t-SNE, PCA, RBF, tag-treemap), the approach targets data-deficiency regions, achieving targeted recall improvements (e.g., T13 from 18% to 31%) with modest overall gains and positive industry feedback. The work demonstrates a practical, human-in-the-loop pathway for data-efficient model improvement in real-world text analytics and points to future industrial deployment and broader generalization.

Abstract

In developing machine learning (ML) models for text classification, one common challenge is that the collected data is often not ideally distributed, especially when new classes are introduced in response to changes of data and tasks. In this paper, we present a solution for using visual analytics (VA) to guide the generation of synthetic data using large language models. As VA enables model developers to identify data-related deficiency, data synthesis can be targeted to address such deficiency. We discuss different types of data deficiency, describe different VA techniques for supporting their identification, and demonstrate the effectiveness of targeted data synthesis in improving model accuracy. In addition, we present a software tool, iGAiVA, which maps four groups of ML tasks into four VA views, integrating generative AI and VA into an ML workflow for developing and improving text classification models.
Paper Structure (23 sections, 2 equations, 25 figures, 8 tables)

This paper contains 23 sections, 2 equations, 25 figures, 8 tables.

Figures (25)

  • Figure 1: The evolution from a conventional ML workflow to an experimental workflow involving the uses of VIS techniques and LLMs for data synthesis, and then to an iterative workflow supported by a VIS4ML tool where VIS and LLMs techniques are integrated.
  • Figure 2: Source data size vs. Recall
  • Figure 3: Visual patterns in a t-SNE scatter plot can offer some hints about the data-related causes of accurate or erroneous classification.
  • Figure 4: Two examples of detailed visual analysis for investigating class T12. The two PCA scatter plots on the left show that Dimension 0 in (a) and Dimension 13 in (b) can separate the data objects into two regions, and data objects in one region have higher recall, while the overall class recall is only 37.5%. Each RBF plot in the second column makes the boundary between the high-recall and low-recall regions clearer, enabling the selection of a division line to study the summary statistics of the messages in the two regions using a tag-treemap on the right.
  • Figure 5: The class T13 has the lowest recall among all classes. The scatter plot in (a) indicates more classification errors (red dots) when the data objects are associated with lower values in PCA feature dimension 0. The RBF heatmap in (b) confirms this pattern and enables data synthesis to be targeted at an erroneous cluster on the left as shown in (c). The model retrained with additional LLM-synthesized data is improved in (d). The RBF heatmap for the new testing results in (e) and the zoomed-in scatter plots confirm the improvement.
  • ...and 20 more figures