Table of Contents
Fetching ...

DSAI: Unbiased and Interpretable Latent Feature Extraction for Data-Centric AI

Hyowon Cho, Soonwon Ka, Daechul Park, Jaewook Kang, Minjoon Seo, Bokyung Son

TL;DR

Data Scientist AI (DSAI) tackles data-grounding and bias in latent feature extraction by introducing a data-centric, five-stage pipeline that guides LLMs to derive interpretable features with a quantifiable prominence metric. The approach mitigates reliance on pre-trained knowledge by keeping the model task-agnostic during feature generation and grounding outputs in concrete data through perspective generation, value matching, clustering, verbalization, and prominence-based selection. Validation on expert-annotated synthetic datasets demonstrates high recall of expert criteria and robust discriminative power, while reliability checks show strong internal consistency across stages. Real-world applications across MIND, SPAM, and Reddit datasets illustrate the method's adaptability, interpretability, and potential for downstream tasks such as classification and annotation guideline formation, though limitations related to LLM quality, data modality, and computational cost remain.

Abstract

Large language models (LLMs) often struggle to objectively identify latent characteristics in large datasets due to their reliance on pre-trained knowledge rather than actual data patterns. To address this data grounding issue, we propose Data Scientist AI (DSAI), a framework that enables unbiased and interpretable feature extraction through a multi-stage pipeline with quantifiable prominence metrics for evaluating extracted features. On synthetic datasets with known ground-truth features, DSAI demonstrates high recall in identifying expert-defined features while faithfully reflecting the underlying data. Applications on real-world datasets illustrate the framework's practical utility in uncovering meaningful patterns with minimal expert oversight, supporting use cases such as interpretable classification. The title of our paper is chosen from multiple candidates based on DSAI-generated criteria.

DSAI: Unbiased and Interpretable Latent Feature Extraction for Data-Centric AI

TL;DR

Data Scientist AI (DSAI) tackles data-grounding and bias in latent feature extraction by introducing a data-centric, five-stage pipeline that guides LLMs to derive interpretable features with a quantifiable prominence metric. The approach mitigates reliance on pre-trained knowledge by keeping the model task-agnostic during feature generation and grounding outputs in concrete data through perspective generation, value matching, clustering, verbalization, and prominence-based selection. Validation on expert-annotated synthetic datasets demonstrates high recall of expert criteria and robust discriminative power, while reliability checks show strong internal consistency across stages. Real-world applications across MIND, SPAM, and Reddit datasets illustrate the method's adaptability, interpretability, and potential for downstream tasks such as classification and annotation guideline formation, though limitations related to LLM quality, data modality, and computational cost remain.

Abstract

Large language models (LLMs) often struggle to objectively identify latent characteristics in large datasets due to their reliance on pre-trained knowledge rather than actual data patterns. To address this data grounding issue, we propose Data Scientist AI (DSAI), a framework that enables unbiased and interpretable feature extraction through a multi-stage pipeline with quantifiable prominence metrics for evaluating extracted features. On synthetic datasets with known ground-truth features, DSAI demonstrates high recall in identifying expert-defined features while faithfully reflecting the underlying data. Applications on real-world datasets illustrate the framework's practical utility in uncovering meaningful patterns with minimal expert oversight, supporting use cases such as interpretable classification. The title of our paper is chosen from multiple candidates based on DSAI-generated criteria.

Paper Structure

This paper contains 71 sections, 8 equations, 19 figures, 14 tables.

Figures (19)

  • Figure 1: DP scores for direct feature generation and DSAI methods.
  • Figure 2: Overview of the DSAI pipeline: Perspectives are first generated to guide analysis (#1), then used to match values to data points (#2). These values are clustered to reduce redundancy (#3), verbalized into concise criteria (#4), and prioritized based on their prominence (#5).
  • Figure 3: Example of interpretable spam classification: The figure shows how feature prominence guides criteria selection, with high-prominence criteria improving spam classification performance.
  • Figure 4: Comparison of Prominence scores and data coverage across frequency and prominence buckets.
  • Figure 5: Dropped criterion as Prominence threshold increases
  • ...and 14 more figures