KAXAI: An Integrated Environment for Knowledge Analysis and Explainable AI
Saikat Barua, Sifat Momen
TL;DR
KAXAI addresses the challenge of making machine learning more accessible by integrating AutoML, Explainable AI, and synthetic data generation into a single, user-friendly environment. The framework combines two novel classifiers, Logistic Regression Forest and Support Vector Tree, with a model-dependent interpreter called MEDLEY to provide local explanations, while leveraging GANs and LLMs for dataset augmentation. The paper reports competitive accuracy on diabetes and survey datasets, demonstrates robust interpretation capabilities, and showcases multiple data-generation strategies with performance and statistical analyses. The work aims to reduce integration friction, improve model transparency, and expand usable data through synthetic generation, offering practical benefits for practitioners and researchers working with tabular data.
Abstract
In order to fully harness the potential of machine learning, it is crucial to establish a system that renders the field more accessible and less daunting for individuals who may not possess a comprehensive understanding of its intricacies. The paper describes the design of a system that integrates AutoML, XAI, and synthetic data generation to provide a great UX design for users. The system allows users to navigate and harness the power of machine learning while abstracting its complexities and providing high usability. The paper proposes two novel classifiers, Logistic Regression Forest and Support Vector Tree, for enhanced model performance, achieving 96\% accuracy on a diabetes dataset and 93\% on a survey dataset. The paper also introduces a model-dependent local interpreter called MEDLEY and evaluates its interpretation against LIME, Greedy, and Parzen. Additionally, the paper introduces LLM-based synthetic data generation, library-based data generation, and enhancing the original dataset with GAN. The findings on synthetic data suggest that enhancing the original dataset with GAN is the most reliable way to generate synthetic data, as evidenced by KS tests, standard deviation, and feature importance. The authors also found that GAN works best for quantitative datasets.
