Model-Free Counterfactual Subset Selection at Scale
Minh Hieu Nguyen, Viet Hung Doan, Anh Tuan Nguyen, Jun Jo, Quoc Viet Hung Nguyen
TL;DR
The paper tackles the challenge of providing actionable, instance-level counterfactual explanations without relying on predictive models or offline data. It introduces a model-free, one-pass streaming algorithm for diversified counterfactual subset selection, achieving $O(\log k)$ update time per incoming item and $O(k)$ space by leveraging extensibility matroids and a combination of diversity constructs. The approach defines multiple utility functions that balance query similarity with diversity, and enforces both per-label and global cardinality constraints to ensure plausible, varied explanations. Empirical results on real and synthetic datasets show reduced user effort (up to $-33\%$ transport cost) and robust performance under adversarial shifts, with runtime suitable for real-time deployment. These results demonstrate the method’s potential for scalable, bias-conscious explanations across domains such as finance, hiring, and healthcare, while maintaining theoretical guarantees of solution quality.
Abstract
Ensuring transparency in AI decision-making requires interpretable explanations, particularly at the instance level. Counterfactual explanations are a powerful tool for this purpose, but existing techniques frequently depend on synthetic examples, introducing biases from unrealistic assumptions, flawed models, or skewed data. Many methods also assume full dataset availability, an impractical constraint in real-time environments where data flows continuously. In contrast, streaming explanations offer adaptive, real-time insights without requiring persistent storage of the entire dataset. This work introduces a scalable, model-free approach to selecting diverse and relevant counterfactual examples directly from observed data. Our algorithm operates efficiently in streaming settings, maintaining $O(\log k)$ update complexity per item while ensuring high-quality counterfactual selection. Empirical evaluations on both real-world and synthetic datasets demonstrate superior performance over baseline methods, with robust behavior even under adversarial conditions.
