Table of Contents
Fetching ...

Not All Splits Are Equal: Rethinking Attribute Generalization Across Unrelated Categories

Liviu Nicolae Fircă, Antonio Bărbălau, Dan Oneata, Elena Burceanu

TL;DR

The paper addresses whether attribute knowledge can generalize across unrelated categories, a question where standard benchmarks allow semantic leakage. It proposes leakage-controlled train-test splits with five grouping strategies (LLM-based, Embedding Similarity, Embedding Clustering, and GT Supercategories) and evaluates attribute prediction using linear probes on the McRae×THINGS dataset (1,854 concepts, 211 attributes) across vision and vision-language embeddings. The results show that as training-test correlation decreases, attribute prediction performance declines; Embedding Clustering provides the best leakage-control while preserving learnability, while GT-based splits minimize leakage but hurt performance. These findings inform the design of fairer benchmarks for attribute reasoning and reveal limitations of current representations in cross-category generalization.

Abstract

Can models generalize attribute knowledge across semantically and perceptually dissimilar categories? While prior work has addressed attribute prediction within narrow taxonomic or visually similar domains, it remains unclear whether current models can abstract attributes and apply them to conceptually distant categories. This work presents the first explicit evaluation for the robustness of the attribute prediction task under such conditions, testing whether models can correctly infer shared attributes between unrelated object types: e.g., identifying that the attribute "has four legs" is common to both "dogs" and "chairs". To enable this evaluation, we introduce train-test split strategies that progressively reduce correlation between training and test sets, based on: LLM-driven semantic grouping, embedding similarity thresholding, embedding-based clustering, and supercategory-based partitioning using ground-truth labels. Results show a sharp drop in performance as the correlation between training and test categories decreases, indicating strong sensitivity to split design. Among the evaluated methods, clustering yields the most effective trade-off, reducing hidden correlations while preserving learnability. These findings offer new insights into the limitations of current representations and inform future benchmark construction for attribute reasoning.

Not All Splits Are Equal: Rethinking Attribute Generalization Across Unrelated Categories

TL;DR

The paper addresses whether attribute knowledge can generalize across unrelated categories, a question where standard benchmarks allow semantic leakage. It proposes leakage-controlled train-test splits with five grouping strategies (LLM-based, Embedding Similarity, Embedding Clustering, and GT Supercategories) and evaluates attribute prediction using linear probes on the McRae×THINGS dataset (1,854 concepts, 211 attributes) across vision and vision-language embeddings. The results show that as training-test correlation decreases, attribute prediction performance declines; Embedding Clustering provides the best leakage-control while preserving learnability, while GT-based splits minimize leakage but hurt performance. These findings inform the design of fairer benchmarks for attribute reasoning and reveal limitations of current representations in cross-category generalization.

Abstract

Can models generalize attribute knowledge across semantically and perceptually dissimilar categories? While prior work has addressed attribute prediction within narrow taxonomic or visually similar domains, it remains unclear whether current models can abstract attributes and apply them to conceptually distant categories. This work presents the first explicit evaluation for the robustness of the attribute prediction task under such conditions, testing whether models can correctly infer shared attributes between unrelated object types: e.g., identifying that the attribute "has four legs" is common to both "dogs" and "chairs". To enable this evaluation, we introduce train-test split strategies that progressively reduce correlation between training and test sets, based on: LLM-driven semantic grouping, embedding similarity thresholding, embedding-based clustering, and supercategory-based partitioning using ground-truth labels. Results show a sharp drop in performance as the correlation between training and test categories decreases, indicating strong sensitivity to split design. Among the evaluated methods, clustering yields the most effective trade-off, reducing hidden correlations while preserving learnability. These findings offer new insights into the limitations of current representations and inform future benchmark construction for attribute reasoning.

Paper Structure

This paper contains 9 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Granularity and coverage of concepts in the grouping methods. A. LLM-based and B. Embedding Similarity offer high precision but leave many concepts ungrouped, risking semantic leakage. While GT: Supercategory Labels and C. Embedding Clustering both ensure full coverage, the former produces overly broad groups, whereas the latter offers finer granularity, enabling more reliable and controlled train-test splits.
  • Figure 2: In Random grouping (left), a positive correlation emerges, indicating reliance on supercategory-specific features. In contrast, the split based on Embedding Clustering yields a near-zero correlation, suggesting improved generalization and reduced semantic leakage.