Table of Contents
Fetching ...

LCA-on-the-Line: Benchmarking Out-of-Distribution Generalization with Class Taxonomies

Jia Shi, Gautam Gare, Jinjin Tian, Siqi Chai, Zhiqiu Lin, Arun Vasudevan, Di Feng, Francesco Ferroni, Shu Kong

TL;DR

This work introduces the Lowest Common Ancestor (LCA) distance as a taxonomy-based metric to benchmark Out-of-Distribution generalization, unifying evaluation across Vision Models and Vision-Language Models. By showing a strong linear relationship between in-distribution LCA distance and OOD accuracy across multiple ImageNet-OOD shifts, the authors demonstrate that semantic misprediction severity is a robust predictor of generalization. They further show that class taxonomies—WordNet or latent hierarchies derived via K-means—can be used to align supervision (soft labels) or prompts to improve OOD performance. The approach offers actionable insights, including soft-label supervision and taxonomy-informed prompting, and provides open-source code for broader adoption. Overall, LCA-on-the-Line advances understanding of how semantic structure in label space relates to robust generalization under significant distribution shifts.

Abstract

We tackle the challenge of predicting models' Out-of-Distribution (OOD) performance using in-distribution (ID) measurements without requiring OOD data. Existing evaluations with "Effective Robustness", which use ID accuracy as an indicator of OOD accuracy, encounter limitations when models are trained with diverse supervision and distributions, such as class labels (Vision Models, VMs, on ImageNet) and textual descriptions (Visual-Language Models, VLMs, on LAION). VLMs often generalize better to OOD data than VMs despite having similar or lower ID performance. To improve the prediction of models' OOD performance from ID measurements, we introduce the Lowest Common Ancestor (LCA)-on-the-Line framework. This approach revisits the established concept of LCA distance, which measures the hierarchical distance between labels and predictions within a predefined class hierarchy, such as WordNet. We assess 75 models using ImageNet as the ID dataset and five significantly shifted OOD variants, uncovering a strong linear correlation between ID LCA distance and OOD top-1 accuracy. Our method provides a compelling alternative for understanding why VLMs tend to generalize better. Additionally, we propose a technique to construct a taxonomic hierarchy on any dataset using K-means clustering, demonstrating that LCA distance is robust to the constructed taxonomic hierarchy. Moreover, we demonstrate that aligning model predictions with class taxonomies, through soft labels or prompt engineering, can enhance model generalization. Open source code in our Project Page: https://elvishelvis.github.io/papers/lca/.

LCA-on-the-Line: Benchmarking Out-of-Distribution Generalization with Class Taxonomies

TL;DR

This work introduces the Lowest Common Ancestor (LCA) distance as a taxonomy-based metric to benchmark Out-of-Distribution generalization, unifying evaluation across Vision Models and Vision-Language Models. By showing a strong linear relationship between in-distribution LCA distance and OOD accuracy across multiple ImageNet-OOD shifts, the authors demonstrate that semantic misprediction severity is a robust predictor of generalization. They further show that class taxonomies—WordNet or latent hierarchies derived via K-means—can be used to align supervision (soft labels) or prompts to improve OOD performance. The approach offers actionable insights, including soft-label supervision and taxonomy-informed prompting, and provides open-source code for broader adoption. Overall, LCA-on-the-Line advances understanding of how semantic structure in label space relates to robust generalization under significant distribution shifts.

Abstract

We tackle the challenge of predicting models' Out-of-Distribution (OOD) performance using in-distribution (ID) measurements without requiring OOD data. Existing evaluations with "Effective Robustness", which use ID accuracy as an indicator of OOD accuracy, encounter limitations when models are trained with diverse supervision and distributions, such as class labels (Vision Models, VMs, on ImageNet) and textual descriptions (Visual-Language Models, VLMs, on LAION). VLMs often generalize better to OOD data than VMs despite having similar or lower ID performance. To improve the prediction of models' OOD performance from ID measurements, we introduce the Lowest Common Ancestor (LCA)-on-the-Line framework. This approach revisits the established concept of LCA distance, which measures the hierarchical distance between labels and predictions within a predefined class hierarchy, such as WordNet. We assess 75 models using ImageNet as the ID dataset and five significantly shifted OOD variants, uncovering a strong linear correlation between ID LCA distance and OOD top-1 accuracy. Our method provides a compelling alternative for understanding why VLMs tend to generalize better. Additionally, we propose a technique to construct a taxonomic hierarchy on any dataset using K-means clustering, demonstrating that LCA distance is robust to the constructed taxonomic hierarchy. Moreover, we demonstrate that aligning model predictions with class taxonomies, through soft labels or prompt engineering, can enhance model generalization. Open source code in our Project Page: https://elvishelvis.github.io/papers/lca/.
Paper Structure (38 sections, 12 equations, 9 figures, 15 tables, 1 algorithm)

This paper contains 38 sections, 12 equations, 9 figures, 15 tables, 1 algorithm.

Figures (9)

  • Figure 1: Correlation between LCA distance and out-of-distribution (OOD) performance in Vision and Vision-Language Models (VLMs). In both panels, the X-axis represents the top-1 accuracy on ObjectNet (OOD test dataset). The Y-axes depict the top-1 accuracy (left-axis) and LCA distance (right-axis) on ImageNet (ID test dataset). The left plot reveals a divergent trend where Vision Models (VMs) show a trade-off between OOD and ID accuracy, while VLMs tend to maintain higher OOD accuracy regardless of ID performance. The right plot demonstrates a unified, strong positive correlation between LCA distance and OOD accuracy for both VMs and VLMs, showing that LCA distance is a robust metric for evaluating model generalization across different architectures, model modalities, and training data sources.
  • Figure 2: Comparison of our setting with prior work.Left: prior work settings such as Accuracy-on-the-line miller2021accuracy and Agreement-on-the-line baek2022agreement. Right: our setting. To the best of our knowledge, LCA-on-the-line is the first approach to uniformly measure model robustness across VMs and VLMs on OOD datasets with significant distribution shifts (ImageNet-S/R/A/O).
  • Figure 3: LCA distance visualization. Our method estimates a model's generalization based on its in-distribution semantic severity of mistakes. We use the 'Lowest Common Ancestor' (LCA) distance to rank the distance between the model's prediction and the ground-truth class within a predefined taxonomic hierarchy, such as WordNet. The LCA distance is proportional to the shortest path from the prediction to the ground-truth class in the hierarchy.
  • Figure 4: Capturing transferable features for model generalization. ImageNet-R maintains shape information geirhos2018imagenet like 'long neck', 'big belly', and 'long legs'. We hypothesize that models with good generalization should capture these transferable features rather than succumbing to spurious correlations such as 'grass', thereby tending to predict classes that are semantically closer to the ground-truth. Such models are expected to have low LCA distances between their predictions and the ground-truth.
  • Figure 5: Correlating OOD Top-1/Top-5 accuracy (VM+VLM, 75 models) on 4 ImageNet-OOD datasets visualizing Table \ref{['tab:correlation_all']}. The plots clearly demonstrate that the in-distribution LCA distance has a stronger correlation with the model's OOD performance across all OOD datasets than accuracy-on-the-linemiller2021accuracy. Each plot's x-axis represents the OOD dataset metric (with OOD Top-1 in the top row, and OOD Top-5 accuracy in the bottom row) and y-axis represents ImageNet ID test Top-1 accuracy (left) and LCA (right); Red line (Pink dots: VMs and Red dots: VLMs) represents in-distribution classification accuracy (Top-1); Green line (Green dots: VMs and Blue dots: VLMs) denotes in-distribution taxonomic distance (LCA). As interpreted in Figure \ref{['fig:explain_vlm']}, accuracy-on-the-line only explains generalization of models within similar settings (VMs or VLMs), but does not unify both settings.
  • ...and 4 more figures