Table of Contents
Fetching ...

Taxonomy-Aware Evaluation of Vision-Language Models

Vésteinn Snæbjarnarson, Kevin Du, Niklas Stoehr, Serge Belongie, Ryan Cotterell, Nico Lang, Stella Frank

TL;DR

This work addresses the challenge of evaluating vision–language models when outputs are unconstrained by a fixed label space, by introducing taxonomy-aware evaluation that maps predictions onto taxonomies and computes hierarchical precision ($\mathrm{hP}$) and hierarchical recall ($\mathrm{hR}$), culminating in a hierarchical F1 ($\mathrm{hF}$). It develops a mapping pipeline that combines lexical and CLIP-based similarity to assign predictions to nodes in Wikidata- and Catalogue of Life–derived taxonomies and demonstrates that these taxonomy-aware metrics reveal model behaviors (e.g., prompt sensitivity and specificity) that traditional accuracy metrics miss. The authors construct and leverage taxonomically linked FGVC datasets (iNaturalist21 and OVEN) and show that ranking of VLMs changes under taxonomy-aware evaluation, with practical implications for prompt design and high-precision applications. The study also provides a detailed methodology for taxonomy extraction, mapping quality assessment, and visualization of the correlations between textual similarity measures and taxonomic proximity, highlighting the value of structure-aware evaluation for open-ended VLM outputs.

Abstract

When a vision-language model (VLM) is prompted to identify an entity depicted in an image, it may answer 'I see a conifer,' rather than the specific label 'norway spruce'. This raises two issues for evaluation: First, the unconstrained generated text needs to be mapped to the evaluation label space (i.e., 'conifer'). Second, a useful classification measure should give partial credit to less-specific, but not incorrect, answers ('norway spruce' being a type of 'conifer'). To meet these requirements, we propose a framework for evaluating unconstrained text predictions, such as those generated from a vision-language model, against a taxonomy. Specifically, we propose the use of hierarchical precision and recall measures to assess the level of correctness and specificity of predictions with regard to a taxonomy. Experimentally, we first show that existing text similarity measures do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the generated text and the ground truth labels. Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our proposed taxonomic evaluation scheme.

Taxonomy-Aware Evaluation of Vision-Language Models

TL;DR

This work addresses the challenge of evaluating vision–language models when outputs are unconstrained by a fixed label space, by introducing taxonomy-aware evaluation that maps predictions onto taxonomies and computes hierarchical precision () and hierarchical recall (), culminating in a hierarchical F1 (). It develops a mapping pipeline that combines lexical and CLIP-based similarity to assign predictions to nodes in Wikidata- and Catalogue of Life–derived taxonomies and demonstrates that these taxonomy-aware metrics reveal model behaviors (e.g., prompt sensitivity and specificity) that traditional accuracy metrics miss. The authors construct and leverage taxonomically linked FGVC datasets (iNaturalist21 and OVEN) and show that ranking of VLMs changes under taxonomy-aware evaluation, with practical implications for prompt design and high-precision applications. The study also provides a detailed methodology for taxonomy extraction, mapping quality assessment, and visualization of the correlations between textual similarity measures and taxonomic proximity, highlighting the value of structure-aware evaluation for open-ended VLM outputs.

Abstract

When a vision-language model (VLM) is prompted to identify an entity depicted in an image, it may answer 'I see a conifer,' rather than the specific label 'norway spruce'. This raises two issues for evaluation: First, the unconstrained generated text needs to be mapped to the evaluation label space (i.e., 'conifer'). Second, a useful classification measure should give partial credit to less-specific, but not incorrect, answers ('norway spruce' being a type of 'conifer'). To meet these requirements, we propose a framework for evaluating unconstrained text predictions, such as those generated from a vision-language model, against a taxonomy. Specifically, we propose the use of hierarchical precision and recall measures to assess the level of correctness and specificity of predictions with regard to a taxonomy. Experimentally, we first show that existing text similarity measures do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the generated text and the ground truth labels. Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our proposed taxonomic evaluation scheme.

Paper Structure

This paper contains 31 sections, 3 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Vision--language-models (VLMs) as fine-grained classifiers. VLMs generate text with varying degrees of specificity and similarity to gold-standard label classes. We tackle the problem of aligning these outputs to taxonomic classes.
  • Figure 3: Illustrative examples. Hierarchical precision ($\mathrm{\textbf{hP}}$) and recall ($\mathrm{\textbf{hR}}$) calculations on random label pairs (\ref{['sec:evalsimmeasures']}). While $\mathrm{\textbf{hP}}$ penalizes incorrect labels, i.e., labels that are further away from the target's ancestor set, $\mathrm{\textbf{hR}}$ penalizes mistakes made higher up on the taxonomy.
  • Figure 4: Agreement in node placement using different similarity measures. We observe variation in node placement for all measures. Only METEOR and ROUGE, and the CLIP variations, when combined with \ref{['alg:tax_mapping']} share predictions frequently.
  • Figure 5: Ranking of VLMs for the iNaturalist21 (top) and OVEN (bottom) datasets. We evaluate the ranking of VLMs (vertical axis) based on different evaluation measures (horizontal axis). The best model is shown in the top row. On the left, we see the model names ranked by exact match and on the right ranked by the hierarchical F1 $\mathrm{\textbf{hF}}$. See \ref{['app:ranking']}, \ref{['tab:ranking-values']} in supplement for exact numbers.
  • Figure 6: Prompt tuning results for bird classifier. For High hP ($\bullet$), we target prompts that prioritize $\mathrm{\textbf{hP}}$. For Acc ($\blacktriangle$) we optimize for prompts that give higher binary accuracy. This demonstrates how $\mathrm{\textbf{hP}}$ and $\mathrm{\textbf{hR}}$ can help tune a VLM application.
  • ...and 2 more figures