LLM-based Hierarchical Concept Decomposition for Interpretable Fine-Grained Image Classification

Renyi Qu; Mark Yatskar

LLM-based Hierarchical Concept Decomposition for Interpretable Fine-Grained Image Classification

Renyi Qu, Mark Yatskar

TL;DR

The paper tackles the interpretability challenge in fine-grained image classification with vision-language models by introducing Hi-CoDe, a framework that generates a structured hierarchical concept tree via GPT-4 and uses an ensemble of simple linear probes on CLIP-derived concept embeddings. It combines visual parts, attributes, and attribute values into root-to-leaf concepts (h(y,p,a,v)) and encodes them into CLIP text embeddings for classification. The approach yields per-part classifiers whose decisions are fused through voting, enabling transparent analysis of which parts and attributes drive predictions. Empirical results on six fine-grained datasets show competitive accuracy with interpretable baselines, while ablation studies demonstrate the value of deeper decomposition and robust voting strategies for both performance and interpretability.

Abstract

(Renyi Qu's Master's Thesis) Recent advancements in interpretable models for vision-language tasks have achieved competitive performance; however, their interpretability often suffers due to the reliance on unstructured text outputs from large language models (LLMs). This introduces randomness and compromises both transparency and reliability, which are essential for addressing safety issues in AI systems. We introduce \texttt{Hi-CoDe} (Hierarchical Concept Decomposition), a novel framework designed to enhance model interpretability through structured concept analysis. Our approach consists of two main components: (1) We use GPT-4 to decompose an input image into a structured hierarchy of visual concepts, thereby forming a visual concept tree. (2) We then employ an ensemble of simple linear classifiers that operate on concept-specific features derived from CLIP to perform classification. Our approach not only aligns with the performance of state-of-the-art models but also advances transparency by providing clear insights into the decision-making process and highlighting the importance of various concepts. This allows for a detailed analysis of potential failure modes and improves model compactness, therefore setting a new benchmark in interpretability without compromising the accuracy.

LLM-based Hierarchical Concept Decomposition for Interpretable Fine-Grained Image Classification

TL;DR

Abstract

Paper Structure (15 sections, 3 figures, 4 tables, 2 algorithms)

This paper contains 15 sections, 3 figures, 4 tables, 2 algorithms.

Introduction
Related Work
Method
Concept Tree Decomposition
Concept Classification
Experiment
Datasets
Baselines
Evaluation
Ablation Studies
Comparison 1: Decomposition Depth
Comparison 2: Label Inclusion
Comparison 3: Voting Strategy
Conclusion
Inference Demo

Figures (3)

Figure 1: Our system.
Figure 2: A snippet of a Part-Attribute subtree for the "Skull" part. The "Head" node serves as the parent visual part of "Skull". The "Attrs" node displays a comprehensive subtree of Name-Value attribute pairs associated with the "skull" part. Each score represents the contribution of its corresponding attribute value to the prediction of the "Head" classifier.
Figure 3: A snippet of the hierarchy of visual parts. The "Mouth" node originates from the "Head" parent node. Subparts of "Mouth," including "Lips," "Teeth," and "Tongue," are represented as child nodes. Each part features its own attribute subtree, independent of its level of decomposition.

LLM-based Hierarchical Concept Decomposition for Interpretable Fine-Grained Image Classification

TL;DR

Abstract

LLM-based Hierarchical Concept Decomposition for Interpretable Fine-Grained Image Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (3)