Table of Contents
Fetching ...

Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction

Michal Bravansky, Vaclav Kubon, Suhas Hariharan, Robert Kirk

TL;DR

The paper tackles the challenge of interpretable dataset understanding by proposing a reconstruction-driven, unsupervised binary feature featurization pipeline that uses large language models to propose and evaluate features with minimal supervision. By formalizing features as binary predicates and optimizing perplexity-based reconstruction, the approach yields compact, semantically meaningful representations that support both granular analysis and cross-dataset comparisons. Empirically, it outperforms prompting baselines across multiple datasets, demonstrates strong scalability, and adapts to practical tasks such as compressing jailbreak tactics and supporting compositional preference modeling. The work has broad implications for scalable, interpretable data analysis and safety-aligned AI, though it notes limitations and avenues for future refinements, including numeric attributes and broader domain applications.

Abstract

Interpreting data is central to modern research. Large language models (LLMs) show promise in providing such natural language interpretations of data, yet simple feature extraction methods such as prompting often fail to produce accurate and versatile descriptions for diverse datasets and lack control over granularity and scale. To address these limitations, we propose a domain-agnostic method for dataset featurization that provides precise control over the number of features extracted while maintaining compact and descriptive representations comparable to human labeling. Our method optimizes the selection of informative binary features by evaluating the ability of an LLM to reconstruct the original data using those features. We demonstrate its effectiveness in dataset modeling tasks and through two case studies: (1) Constructing a feature representation of jailbreak tactics that compactly captures both the effectiveness and diversity of a larger set of human-crafted attacks; and (2) automating the discovery of features that align with human preferences, achieving accuracy and robustness comparable to human-crafted features. Moreover, we show that the pipeline scales effectively, improving as additional features are sampled, making it suitable for large and diverse datasets.

Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction

TL;DR

The paper tackles the challenge of interpretable dataset understanding by proposing a reconstruction-driven, unsupervised binary feature featurization pipeline that uses large language models to propose and evaluate features with minimal supervision. By formalizing features as binary predicates and optimizing perplexity-based reconstruction, the approach yields compact, semantically meaningful representations that support both granular analysis and cross-dataset comparisons. Empirically, it outperforms prompting baselines across multiple datasets, demonstrates strong scalability, and adapts to practical tasks such as compressing jailbreak tactics and supporting compositional preference modeling. The work has broad implications for scalable, interpretable data analysis and safety-aligned AI, though it notes limitations and avenues for future refinements, including numeric attributes and broader domain applications.

Abstract

Interpreting data is central to modern research. Large language models (LLMs) show promise in providing such natural language interpretations of data, yet simple feature extraction methods such as prompting often fail to produce accurate and versatile descriptions for diverse datasets and lack control over granularity and scale. To address these limitations, we propose a domain-agnostic method for dataset featurization that provides precise control over the number of features extracted while maintaining compact and descriptive representations comparable to human labeling. Our method optimizes the selection of informative binary features by evaluating the ability of an LLM to reconstruct the original data using those features. We demonstrate its effectiveness in dataset modeling tasks and through two case studies: (1) Constructing a feature representation of jailbreak tactics that compactly captures both the effectiveness and diversity of a larger set of human-crafted attacks; and (2) automating the discovery of features that align with human preferences, achieving accuracy and robustness comparable to human-crafted features. Moreover, we show that the pipeline scales effectively, improving as additional features are sampled, making it suitable for large and diverse datasets.

Paper Structure

This paper contains 66 sections, 7 figures, 14 tables, 1 algorithm.

Figures (7)

  • Figure 1: The proposed pipeline is able to extract semantically and structurally rich binary features from unsupervised data. Initially, an LLM analyzes each input text to generate candidate features. These candidates undergo clustering-based filtration to remove duplicates. The system then measures how well each feature enables reconstruction of the original data samples when provided as context to an LLM for texts containing that feature, measuring reconstruction quality via perplexity (PPL) and iteratively concatenating features to create a set that captures the dataset's properties.
  • Figure 2: Our pipeline outperforms LLM prompting and zhong2024explaining in feature extraction across almost all metrics and datasets, showing higher class coverage (average correlation between classes and closest features), reconstruction accuracy (linear model accuracy on classes), and semantic preservation (number of semantically similar features as judged by an LLM), with the last reconstruction-based stage proving crucial for surpassing the baselines.
  • Figure 3: Our PMs demonstrate competitive performance with expert-crafted features in both generalization and robustness. Left: Generalization performance versus reference PM (smaller gap is better) shows comparable results across datasets. Right: Robustness analysis between PM A and PM B (smaller difference is better) reveals our model's superior performance on SHP and comparable results on HH-RLHF. All confidence intervals computed using 500 bootstrap iterations over prompts.
  • Figure 4: Impact of feature sampling size on jailbreak performance metrics described in \ref{['sec:jailbreak-evaluation-metrics']}. While ASR plateaus early, diversity metrics continue to improve until approximately 20 features, after which improvements become minimal across all metrics.
  • Figure 5: Features selected by our full pipeline demonstrate better robustness than clustering-only features, as shown by smaller differences between PMs trained on separate datasets. This indicates our method selects features that capture more generalizable preference patterns rather than overfitting to specific data subsets.
  • ...and 2 more figures