Interpretable Generative Models through Post-hoc Concept Bottlenecks
Akshay Kulkarni, Ge Yan, Chung-En Sun, Tuomas Oikarinen, Tsui-Wei Weng
TL;DR
This work tackles the interpretability and scalability gap in generative models by introducing two post-hoc approaches, CB-AE and CC, that convert a pretrained generator into an inherently interpretable model without retraining from scratch or relying on abundant concept labels. The CB-AE inserts a bottleneck autoencoder between the generator halves, learning a concept space $c$ via an encoder $E$ and decoder $D$ while keeping $g_1$ and $g_2$ fixed; a lightweight CC provides concept predictions from $g_1(z)$ via a predictor $\Omega$. Interventions are enabled through optimization-based and cyclic losses, and a pseudo-label source $M$ (e.g., CLIP or supervised classifiers) drives concept alignment, which allows steering at test time and during training with minimal supervision. Across GANs and diffusion models on CelebA, CelebA-HQ, and CUB, CB-AE and CC achieve substantially higher steerability (avg ~25%) and are 4–15× faster to train than prior work, with large-scale human studies validating interpretability. The methods generalize to multiple model families and offer a practical path to scalable, post-hoc interpretable generation without expensive labeled data or scratch training.
Abstract
Concept bottleneck models (CBM) aim to produce inherently interpretable models that rely on human-understandable concepts for their predictions. However, existing approaches to design interpretable generative models based on CBMs are not yet efficient and scalable, as they require expensive generative model training from scratch as well as real images with labor-intensive concept supervision. To address these challenges, we present two novel and low-cost methods to build interpretable generative models through post-hoc techniques and we name our approaches: concept-bottleneck autoencoder (CB-AE) and concept controller (CC). Our proposed approaches enable efficient and scalable training without the need of real data and require only minimal to no concept supervision. Additionally, our methods generalize across modern generative model families including generative adversarial networks and diffusion models. We demonstrate the superior interpretability and steerability of our methods on numerous standard datasets like CelebA, CelebA-HQ, and CUB with large improvements (average ~25%) over the prior work, while being 4-15x faster to train. Finally, a large-scale user study is performed to validate the interpretability and steerability of our methods.
