Generating Samples to Probe Trained Models
Eren Mehmet Kıral, Nurşen Aydın, Ş. İlker Birbil
TL;DR
This work proposes a probabilistic framework for interrogating trained models by generating data samples that reflect specified probing questions. By formulating a data-space objective G and pairing it with a parameter-space objective F, the authors exploit Gibbs-based sampling (via MALA) and, optionally, latent-space encodings from VAEs to produce samples that reveal prediction-risky regions, parameter sensitivity, and model contrasts. The approach is demonstrated across diverse tasks and modalities (tabular, image) with experiments showing surface-level disagreements between models, near-boundary uncertainties, and data-manifold-aware counterfactuals, along with comparisons to existing counterfactual methods like DiCE. This framework provides a flexible, interpretable way to analyze model behavior beyond traditional accuracy metrics and offers practical insights for fairness, robustness, and explainability. The authors also release code and discuss future directions to incorporate domain-specific constraints and richer priors.
Abstract
There is a growing need for investigating how machine learning models operate. With this work, we aim to understand trained machine learning models by questioning their data preferences. We propose a mathematical framework that allows us to probe trained models and identify their preferred samples in various scenarios including prediction-risky, parameter-sensitive, or model-contrastive samples. To showcase our framework, we pose these queries to a range of models trained on a range of classification and regression tasks, and receive answers in the form of generated data.
