Visual Superordinate Abstraction for Robust Concept Learning
Qi Zheng, Chaoyue Wang, Dadong Wang, Dacheng Tao
TL;DR
This work tackles the fragility of concept learners under perturbations and out-of-distribution compositions in vision–language tasks by introducing Visual Superordinate Abstraction. It jointly learns a linguistic hierarchy from natural VQA data and constructs semantic-aware visual subspaces (visual superordinates) to isolate attributes, complemented by quasi-center clustering and superordinate shortcut learning to boost discrimination and curb spurious causal effects. Empirical results on CLEVR, CLEVR-CoGenT, and CLEVR-Perturb show competitive regular reasoning performance and substantial gains in robustness to perturbations and new compositions, with notable improvements in counting and bias mitigation. The framework advances robust concept learning by aligning language-driven structure with visual representations, enabling better generalization in downstream vision–language reasoning tasks.
Abstract
Concept learning constructs visual representations that are connected to linguistic semantics, which is fundamental to vision-language tasks. Although promising progress has been made, existing concept learners are still vulnerable to attribute perturbations and out-of-distribution compositions during inference. We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts, e.g. \{red, blue,...\} $\in$ `color' subspace yet cube $\in$ `shape'. In this paper, we propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces (i.e. visual superordinates). With only natural visual question answering data, our model first acquires the semantic hierarchy from a linguistic view, and then explores mutually exclusive visual superordinates under the guidance of linguistic hierarchy. In addition, a quasi-center visual concept clustering and a superordinate shortcut learning schemes are proposed to enhance the discrimination and independence of concepts within each visual superordinate. Experiments demonstrate the superiority of the proposed framework under diverse settings, which increases the overall answering accuracy relatively by 7.5\% on reasoning with perturbations and 15.6\% on compositional generalization tests.
