Taxonomy-Aware Evaluation of Vision-Language Models
Vésteinn Snæbjarnarson, Kevin Du, Niklas Stoehr, Serge Belongie, Ryan Cotterell, Nico Lang, Stella Frank
TL;DR
This work addresses the challenge of evaluating vision–language models when outputs are unconstrained by a fixed label space, by introducing taxonomy-aware evaluation that maps predictions onto taxonomies and computes hierarchical precision ($\mathrm{hP}$) and hierarchical recall ($\mathrm{hR}$), culminating in a hierarchical F1 ($\mathrm{hF}$). It develops a mapping pipeline that combines lexical and CLIP-based similarity to assign predictions to nodes in Wikidata- and Catalogue of Life–derived taxonomies and demonstrates that these taxonomy-aware metrics reveal model behaviors (e.g., prompt sensitivity and specificity) that traditional accuracy metrics miss. The authors construct and leverage taxonomically linked FGVC datasets (iNaturalist21 and OVEN) and show that ranking of VLMs changes under taxonomy-aware evaluation, with practical implications for prompt design and high-precision applications. The study also provides a detailed methodology for taxonomy extraction, mapping quality assessment, and visualization of the correlations between textual similarity measures and taxonomic proximity, highlighting the value of structure-aware evaluation for open-ended VLM outputs.
Abstract
When a vision-language model (VLM) is prompted to identify an entity depicted in an image, it may answer 'I see a conifer,' rather than the specific label 'norway spruce'. This raises two issues for evaluation: First, the unconstrained generated text needs to be mapped to the evaluation label space (i.e., 'conifer'). Second, a useful classification measure should give partial credit to less-specific, but not incorrect, answers ('norway spruce' being a type of 'conifer'). To meet these requirements, we propose a framework for evaluating unconstrained text predictions, such as those generated from a vision-language model, against a taxonomy. Specifically, we propose the use of hierarchical precision and recall measures to assess the level of correctness and specificity of predictions with regard to a taxonomy. Experimentally, we first show that existing text similarity measures do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the generated text and the ground truth labels. Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our proposed taxonomic evaluation scheme.
