Table of Contents
Fetching ...

Visual Superordinate Abstraction for Robust Concept Learning

Qi Zheng, Chaoyue Wang, Dadong Wang, Dacheng Tao

TL;DR

This work tackles the fragility of concept learners under perturbations and out-of-distribution compositions in vision–language tasks by introducing Visual Superordinate Abstraction. It jointly learns a linguistic hierarchy from natural VQA data and constructs semantic-aware visual subspaces (visual superordinates) to isolate attributes, complemented by quasi-center clustering and superordinate shortcut learning to boost discrimination and curb spurious causal effects. Empirical results on CLEVR, CLEVR-CoGenT, and CLEVR-Perturb show competitive regular reasoning performance and substantial gains in robustness to perturbations and new compositions, with notable improvements in counting and bias mitigation. The framework advances robust concept learning by aligning language-driven structure with visual representations, enabling better generalization in downstream vision–language reasoning tasks.

Abstract

Concept learning constructs visual representations that are connected to linguistic semantics, which is fundamental to vision-language tasks. Although promising progress has been made, existing concept learners are still vulnerable to attribute perturbations and out-of-distribution compositions during inference. We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts, e.g. \{red, blue,...\} $\in$ `color' subspace yet cube $\in$ `shape'. In this paper, we propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces (i.e. visual superordinates). With only natural visual question answering data, our model first acquires the semantic hierarchy from a linguistic view, and then explores mutually exclusive visual superordinates under the guidance of linguistic hierarchy. In addition, a quasi-center visual concept clustering and a superordinate shortcut learning schemes are proposed to enhance the discrimination and independence of concepts within each visual superordinate. Experiments demonstrate the superiority of the proposed framework under diverse settings, which increases the overall answering accuracy relatively by 7.5\% on reasoning with perturbations and 15.6\% on compositional generalization tests.

Visual Superordinate Abstraction for Robust Concept Learning

TL;DR

This work tackles the fragility of concept learners under perturbations and out-of-distribution compositions in vision–language tasks by introducing Visual Superordinate Abstraction. It jointly learns a linguistic hierarchy from natural VQA data and constructs semantic-aware visual subspaces (visual superordinates) to isolate attributes, complemented by quasi-center clustering and superordinate shortcut learning to boost discrimination and curb spurious causal effects. Empirical results on CLEVR, CLEVR-CoGenT, and CLEVR-Perturb show competitive regular reasoning performance and substantial gains in robustness to perturbations and new compositions, with notable improvements in counting and bias mitigation. The framework advances robust concept learning by aligning language-driven structure with visual representations, enabling better generalization in downstream vision–language reasoning tasks.

Abstract

Concept learning constructs visual representations that are connected to linguistic semantics, which is fundamental to vision-language tasks. Although promising progress has been made, existing concept learners are still vulnerable to attribute perturbations and out-of-distribution compositions during inference. We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts, e.g. \{red, blue,...\} `color' subspace yet cube `shape'. In this paper, we propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces (i.e. visual superordinates). With only natural visual question answering data, our model first acquires the semantic hierarchy from a linguistic view, and then explores mutually exclusive visual superordinates under the guidance of linguistic hierarchy. In addition, a quasi-center visual concept clustering and a superordinate shortcut learning schemes are proposed to enhance the discrimination and independence of concepts within each visual superordinate. Experiments demonstrate the superiority of the proposed framework under diverse settings, which increases the overall answering accuracy relatively by 7.5\% on reasoning with perturbations and 15.6\% on compositional generalization tests.
Paper Structure (18 sections, 8 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 8 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Illustration of color perturbations. The left image share consistent attributes with training samples, and the right one is synthesized by adding perturbations to the color attribute. We present the scores of judging metal and sphere by NSCL mao2019neuro (in blue bars) and our method (in red bars). The colorful rectangles refer to the object in corresponding bounding boxes.
  • Figure 2: An overview of the learning process of the proposed Visual Superordinate Abstraction framework. (a) Originally, there is a set of visual concepts and detected objects from visual question answering data. (b) After a simple curriculum (C-I), the learner acquires linguistic abstraction from the weak supervision. It categorizes concepts (e.g. 'red', 'circle', 'small') into different linguistic hierarchy (e.g. color, shape and size). (c) In the following difficult curriculum (C-II), the learner constructs semantic-aware subspaces (i.e. visual superordinates $X$, $Y$ and $Z$) aligning with the linguistic hierarchy. In each subspace, it extracts only one desired visual feature from detected objects, which can be described with the closest concept. For forming ideal visual superordinates, quasi-center visual concept clustering (Figure \ref{['fig:contrastive']}) and superordinate shortcut learning (Figure \ref{['fig:causal']}) are proposed and introduced in the following.
  • Figure 3: Demonstration of quasi-center visual concept clustering in color superordinate. $\otimes$ denotes the projection of concept words in the visual subspace, and $+$ denotes the quasi-center of visual concept clusters (i.e. $\mathbf{u}_m$). Quasi-center concept clustering not only accumulates the visual representations around a concept, but also reduces the distance between concept words (e.g. $\otimes$ for "red") and corresponding cluster centers (e.g. $+$). View the color version.
  • Figure 4: Demonstration of superordinate shortcut learning. To measure whether there is a spurious causal effect between $X$ and $Y$, the learner explicitly learns a shortcut $h$ by interfering $Y$ from $X$. ${//}$ is the stop-gradient operator, which indicates that no gradient propagates through $X$ when training $h$.
  • Figure 5: Illustration of biased data. Image A shows an example from CLEVR CoGenT val-A split, and image B is from the val-B split. The bar graph shows the learned correlation of $p(Y|X)$ from the biased training data.
  • ...and 6 more figures