DSAI: Unbiased and Interpretable Latent Feature Extraction for Data-Centric AI
Hyowon Cho, Soonwon Ka, Daechul Park, Jaewook Kang, Minjoon Seo, Bokyung Son
TL;DR
Data Scientist AI (DSAI) tackles data-grounding and bias in latent feature extraction by introducing a data-centric, five-stage pipeline that guides LLMs to derive interpretable features with a quantifiable prominence metric. The approach mitigates reliance on pre-trained knowledge by keeping the model task-agnostic during feature generation and grounding outputs in concrete data through perspective generation, value matching, clustering, verbalization, and prominence-based selection. Validation on expert-annotated synthetic datasets demonstrates high recall of expert criteria and robust discriminative power, while reliability checks show strong internal consistency across stages. Real-world applications across MIND, SPAM, and Reddit datasets illustrate the method's adaptability, interpretability, and potential for downstream tasks such as classification and annotation guideline formation, though limitations related to LLM quality, data modality, and computational cost remain.
Abstract
Large language models (LLMs) often struggle to objectively identify latent characteristics in large datasets due to their reliance on pre-trained knowledge rather than actual data patterns. To address this data grounding issue, we propose Data Scientist AI (DSAI), a framework that enables unbiased and interpretable feature extraction through a multi-stage pipeline with quantifiable prominence metrics for evaluating extracted features. On synthetic datasets with known ground-truth features, DSAI demonstrates high recall in identifying expert-defined features while faithfully reflecting the underlying data. Applications on real-world datasets illustrate the framework's practical utility in uncovering meaningful patterns with minimal expert oversight, supporting use cases such as interpretable classification. The title of our paper is chosen from multiple candidates based on DSAI-generated criteria.
