Sample-efficient Learning of Concepts with Theoretical Guarantees: from Data to Concepts without Interventions
Hidde Fokkema, Tim van Erven, Sara Magliacane
TL;DR
The paper tackles interpretability and robustness in AI by marrying causal representation learning (CRL) with a principled, label-efficient alignment to human concepts. It introduces two alignment estimators—a linear Group Lasso and a kernelized, non-parametric variant—each providing formal guarantees on concept correctness and label efficiency without requiring interventions. Theoretical results establish finite-sample bounds and asymptotic consistency for the permutation alignment, while experiments on synthetic and image datasets show reduced concept impurity and competitive downstream accuracy with far fewer concept labels than standard CBMs. The work advances practical, theory-grounded concept learning, enabling more reliable and interpretable AI systems in settings with correlated concepts.
Abstract
Machine learning is a vital part of many real-world systems, but several concerns remain about the lack of interpretability, explainability and robustness of black-box AI systems. Concept Bottleneck Models (CBM) address some of these challenges by learning interpretable concepts from high-dimensional data, e.g. images, which are used to predict labels. An important issue in CBMs are spurious correlation between concepts, which effectively lead to learning "wrong" concepts. Current mitigating strategies have strong assumptions, e.g., they assume that the concepts are statistically independent of each other, or require substantial interaction in terms of both interventions and labels provided by annotators. In this paper, we describe a framework that provides theoretical guarantees on the correctness of the learned concepts and on the number of required labels, without requiring any interventions. Our framework leverages causal representation learning (CRL) methods to learn latent causal variables from high-dimensional observations in a unsupervised way, and then learns to align these variables with interpretable concepts with few concept labels. We propose a linear and a non-parametric estimator for this mapping, providing a finite-sample high probability result in the linear case and an asymptotic consistency result for the non-parametric estimator. We evaluate our framework in synthetic and image benchmarks, showing that the learned concepts have less impurities and are often more accurate than other CBMs, even in settings with strong correlations between concepts.
