Not All Splits Are Equal: Rethinking Attribute Generalization Across Unrelated Categories
Liviu Nicolae Fircă, Antonio Bărbălau, Dan Oneata, Elena Burceanu
TL;DR
The paper addresses whether attribute knowledge can generalize across unrelated categories, a question where standard benchmarks allow semantic leakage. It proposes leakage-controlled train-test splits with five grouping strategies (LLM-based, Embedding Similarity, Embedding Clustering, and GT Supercategories) and evaluates attribute prediction using linear probes on the McRae×THINGS dataset (1,854 concepts, 211 attributes) across vision and vision-language embeddings. The results show that as training-test correlation decreases, attribute prediction performance declines; Embedding Clustering provides the best leakage-control while preserving learnability, while GT-based splits minimize leakage but hurt performance. These findings inform the design of fairer benchmarks for attribute reasoning and reveal limitations of current representations in cross-category generalization.
Abstract
Can models generalize attribute knowledge across semantically and perceptually dissimilar categories? While prior work has addressed attribute prediction within narrow taxonomic or visually similar domains, it remains unclear whether current models can abstract attributes and apply them to conceptually distant categories. This work presents the first explicit evaluation for the robustness of the attribute prediction task under such conditions, testing whether models can correctly infer shared attributes between unrelated object types: e.g., identifying that the attribute "has four legs" is common to both "dogs" and "chairs". To enable this evaluation, we introduce train-test split strategies that progressively reduce correlation between training and test sets, based on: LLM-driven semantic grouping, embedding similarity thresholding, embedding-based clustering, and supercategory-based partitioning using ground-truth labels. Results show a sharp drop in performance as the correlation between training and test categories decreases, indicating strong sensitivity to split design. Among the evaluated methods, clustering yields the most effective trade-off, reducing hidden correlations while preserving learnability. These findings offer new insights into the limitations of current representations and inform future benchmark construction for attribute reasoning.
