Leveraging Taxonomy and LLMs for Improved Multimodal Hierarchical Classification
Shijing Chen, Mohamed Reda Bouadjenek, Shoaib Jameel, Usman Naseem, Basem Suleiman, Flora D. Salim, Hakim Hacid, Imran Razzak
TL;DR
The paper tackles multi-level hierarchical classification (MLHC) with deep taxonomies by addressing inconsistencies that arise when using flat or locally trained classifiers. It introduces a taxonomy-based transitional classifier (TTC) that is backbone-agnostic and uses a transition matrix $M^{[\ell_i,\ell_{i+1}]}$ to enforce hierarchical consistency across levels, with per-level logits $z^{[\ell_i]} = W^{[\ell_i]} a + b^{[\ell_i]}$ and $m^{[\ell_{i+1}]} = \hat{y}^{[\ell_i]} \times M^{[\ell_i,\ell_{i+1}]}$. Empirically, TTC improves consistency, exact match, and $\ell_3$ accuracy across diverse multimodal LLM backbones on the MEP-3M dataset, though HF1-Score can decline slightly in some cases. These results demonstrate the practicality of a model-agnostic, taxonomy-aware classifier for multimodal MLHC and suggest broader applicability to hierarchical and standard classification tasks.
Abstract
Multi-level Hierarchical Classification (MLHC) tackles the challenge of categorizing items within a complex, multi-layered class structure. However, traditional MLHC classifiers often rely on a backbone model with independent output layers, which tend to ignore the hierarchical relationships between classes. This oversight can lead to inconsistent predictions that violate the underlying taxonomy. Leveraging Large Language Models (LLMs), we propose a novel taxonomy-embedded transitional LLM-agnostic framework for multimodality classification. The cornerstone of this advancement is the ability of models to enforce consistency across hierarchical levels. Our evaluations on the MEP-3M dataset - a multi-modal e-commerce product dataset with various hierarchical levels - demonstrated a significant performance improvement compared to conventional LLM structures.
