Neural Additive Image Model: Interpretation through Interpolation
Arik Reuter, Anton Thielmann, Benjamin Saefken
TL;DR
This work addresses the challenge of interpreting image-driven predictions in multi-modal settings by introducing the Neural Additive Image Model (NAIM), which couples Neural Additive Models with Diffusion Autoencoders to yield globally interpretable image effects within an additive framework. NAIM encodes images into a semantically meaningful latent space using a Diffusion Autoencoder and models image effects with a dedicated function $f_{img}$ on the latent code $\bm{z}$, preserving additivity for global interpretability. The authors validate the approach on synthetic data, showing accurate recovery of both numerical and image effects, and apply it to Airbnb pricing with host images, achieving higher $R^2$ than baselines and enabling both global and local interpretability through latent-space interpolation and attribute manipulation. This approach enables transparent analysis of image contributions in high-stakes domains and offers a practical path toward bias detection and fairness in multi-modal predictive systems.
Abstract
Understanding how images influence the world, interpreting which effects their semantics have on various quantities and exploring the reasons behind changes in image-based predictions are highly difficult yet extremely interesting problems. By adopting a holistic modeling approach utilizing Neural Additive Models in combination with Diffusion Autoencoders, we can effectively identify the latent hidden semantics of image effects and achieve full intelligibility of additional tabular effects. Our approach offers a high degree of flexibility, empowering us to comprehensively explore the impact of various image characteristics. We demonstrate that the proposed method can precisely identify complex image effects in an ablation study. To further showcase the practical applicability of our proposed model, we conduct a case study in which we investigate how the distinctive features and attributes captured within host images exert influence on the pricing of Airbnb rentals.
