UNBOX: Unveiling Black-box visual models with Natural-language

Simone Carnemolla; Chiara Russo; Simone Palazzo; Quentin Bouniot; Daniela Giordano; Zeynep Akata; Matteo Pennisi; Concetto Spampinato

UNBOX: Unveiling Black-box visual models with Natural-language

Simone Carnemolla, Chiara Russo, Simone Palazzo, Quentin Bouniot, Daniela Giordano, Zeynep Akata, Matteo Pennisi, Concetto Spampinato

TL;DR

UNBOX, a framework for class-wise model dissection under fully data-free, gradient-free, and backpropagation-free constraints, demonstrates that meaningful insight into a model's internal reasoning can be recovered without any internal access, enabling more trustworthy and accountable visual recognition systems.

Abstract

Ensuring trustworthiness in open-world visual recognition requires models that are interpretable, fair, and robust to distribution shifts. Yet modern vision systems are increasingly deployed as proprietary black-box APIs, exposing only output probabilities and hiding architecture, parameters, gradients, and training data. This opacity prevents meaningful auditing, bias detection, and failure analysis. Existing explanation methods assume white- or gray-box access or knowledge of the training distribution, making them unusable in these real-world settings. We introduce UNBOX, a framework for class-wise model dissection under fully data-free, gradient-free, and backpropagation-free constraints. UNBOX leverages Large Language Models and text-to-image diffusion models to recast activation maximization as a purely semantic search driven by output probabilities. The method produces human-interpretable text descriptors that maximally activate each class, revealing the concepts a model has implicitly learned, the training distribution it reflects, and potential sources of bias. We evaluate UNBOX on ImageNet-1K, Waterbirds, and CelebA through semantic fidelity tests, visual-feature correlation analyses and slice-discovery auditing. Despite operating under the strictest black-box constraints, UNBOX performs competitively with state-of-the-art white-box interpretability methods. This demonstrates that meaningful insight into a model's internal reasoning can be recovered without any internal access, enabling more trustworthy and accountable visual recognition systems.

UNBOX: Unveiling Black-box visual models with Natural-language

TL;DR

Abstract

Paper Structure (18 sections, 14 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 14 equations, 3 figures, 5 tables, 1 algorithm.

Introduction
Related work
The UNBOX Method
Semantic Optimization Mechanism
Global and Local Optimization Context
Experiments
Experimental setup
Optimization setup
Results
Class Semantic Fidelity
Latent Training Semantics Recovery
Alignment with data-derived descriptors.
Visual grounding on validation images.
Slice discovery and debiasing
Ablation
...and 3 more sections

Figures (3)

Figure 1: Overview of the UNBOX framework. Given a target class j and a black-box classifier $\phi$, UNBOX reformulates activation maximization as a semantic optimization process carried out entirely in text space. The method iterates through three main components. (1) Prompt Evaluation (left): a natural-language prompt $p_t$ is rendered into an image $x_t = G(p_t)$ by a text-to-image diffusion model, which is then evaluated by the classifier to obtain the target-class score $s_t = \phi(x_t)_j$. (2) Optimization Context (center): the scalar feedback $s_t$ is transformed into a semantic guidance signal through trend $T_t$ and intensity $I_t$, which together define the textual loss $\mathcal{S}_t$. This module also maintains a Local Optimization Context $C_t$ capturing recent prompt–feedback pairs, and a Global Optimization Context $(P_{\text{best}}, D_{\text{best}})$ that accumulates high-scoring prompts and persistent lexical concepts. (3) Textual Optimization (right): two cooperating LLM agents perform prompt refinement: the Feedback Agent $A_f$ interprets the semantic signal and optimization context to generate structured feedback $e_t$, which the Updater Agent $A_u$ applies to produce the next prompt $p_{t+1}$. The loop repeats until convergence, yielding a ranked set of human-interpretable descriptors that characterize the target class under strict black-box constraints.
Figure 2: Excerpts of the system prompts for $A_f$ and $A_u$, along with the look-up table used to retrieve the text associated with $T_{s_t}$ and $I_{s_t}$ for the formulation of $\mathcal{S}_t$.
Figure 3: Qualitative examples of spurious feature attribution revealed by UNBOX. For each class, we show generated images that activate the target neuron, the corresponding GradCAM visualizations, and the highest-scoring lexical units in $D_{\text{best}}$. These examples illustrate how non-causal cues (e.g., contextual elements, co-occurring objects, or geometric patterns) can dominate the classifier’s decision process.

UNBOX: Unveiling Black-box visual models with Natural-language

TL;DR

Abstract

UNBOX: Unveiling Black-box visual models with Natural-language

Authors

TL;DR

Abstract

Table of Contents

Figures (3)