Table of Contents
Fetching ...

Focus-Consistent Multi-Level Aggregation for Compositional Zero-Shot Learning

Fengyuan Dai, Siteng Huang, Min Zhang, Biao Gong, Donglin Wang

TL;DR

This work tackles compositional zero-shot learning by addressing inconsistency and lack of cross-branch spatial sharing in the common three-branch setup. It introduces Focus-Consistent Multi-Level Aggregation (FOMA), combining a Multi-Level Feature Aggregation (MFA) module that produces instance-specific, per-branch features from $f_1,f_2,f_3$ and a Focus-Consistent Constraint (FCC) that aligns attention across branches using Grad-CAM–style maps. The model is trained with $L = L_{cls} + \alpha L_f$, and inference sums scores from all branches, including a CGE–GCN composition branch. Experiments on UT-Zappos, C-GQA, and Clothing16K show state-of-the-art performance across HM, AUC, and seen/unseen balance, validating the effectiveness of per-branch feature specialization and cross-branch focus alignment for CZSL.

Abstract

To transfer knowledge from seen attribute-object compositions to recognize unseen ones, recent compositional zero-shot learning (CZSL) methods mainly discuss the optimal classification branches to identify the elements, leading to the popularity of employing a three-branch architecture. However, these methods mix up the underlying relationship among the branches, in the aspect of consistency and diversity. Specifically, consistently providing the highest-level features for all three branches increases the difficulty in distinguishing classes that are superficially similar. Furthermore, a single branch may focus on suboptimal regions when spatial messages are not shared between the personalized branches. Recognizing these issues and endeavoring to address them, we propose a novel method called Focus-Consistent Multi-Level Aggregation (FOMA). Our method incorporates a Multi-Level Feature Aggregation (MFA) module to generate personalized features for each branch based on the image content. Additionally, a Focus-Consistent Constraint encourages a consistent focus on the informative regions, thereby implicitly exchanging spatial information between all branches. Extensive experiments on three benchmark datasets (UT-Zappos, C-GQA, and Clothing16K) demonstrate that our FOMA outperforms SOTA.

Focus-Consistent Multi-Level Aggregation for Compositional Zero-Shot Learning

TL;DR

This work tackles compositional zero-shot learning by addressing inconsistency and lack of cross-branch spatial sharing in the common three-branch setup. It introduces Focus-Consistent Multi-Level Aggregation (FOMA), combining a Multi-Level Feature Aggregation (MFA) module that produces instance-specific, per-branch features from and a Focus-Consistent Constraint (FCC) that aligns attention across branches using Grad-CAM–style maps. The model is trained with , and inference sums scores from all branches, including a CGE–GCN composition branch. Experiments on UT-Zappos, C-GQA, and Clothing16K show state-of-the-art performance across HM, AUC, and seen/unseen balance, validating the effectiveness of per-branch feature specialization and cross-branch focus alignment for CZSL.

Abstract

To transfer knowledge from seen attribute-object compositions to recognize unseen ones, recent compositional zero-shot learning (CZSL) methods mainly discuss the optimal classification branches to identify the elements, leading to the popularity of employing a three-branch architecture. However, these methods mix up the underlying relationship among the branches, in the aspect of consistency and diversity. Specifically, consistently providing the highest-level features for all three branches increases the difficulty in distinguishing classes that are superficially similar. Furthermore, a single branch may focus on suboptimal regions when spatial messages are not shared between the personalized branches. Recognizing these issues and endeavoring to address them, we propose a novel method called Focus-Consistent Multi-Level Aggregation (FOMA). Our method incorporates a Multi-Level Feature Aggregation (MFA) module to generate personalized features for each branch based on the image content. Additionally, a Focus-Consistent Constraint encourages a consistent focus on the informative regions, thereby implicitly exchanging spatial information between all branches. Extensive experiments on three benchmark datasets (UT-Zappos, C-GQA, and Clothing16K) demonstrate that our FOMA outperforms SOTA.
Paper Structure (18 sections, 12 equations, 8 figures, 4 tables)

This paper contains 18 sections, 12 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Conceptual comparison in terms of consistency and diversity. Features of the image are inputted into the model, which consists of three classification branches. The heatmap is generated based on the classification process on each branch. On the left, we showcase low-level, middle-level, and high-level features of the image, in which the lower-level features contain more intricate details. On the right, the heatmaps show that constraining branches to focus on the same area mitigates the risk of being misled.
  • Figure 2: The architecture of FOMA mainly consists of the three branches, along with MFA module and Focus-Consistent Constraint. Specifically, the image is first inputted into the backbone to extract features. Then MFA module takes both the original image and multi-level features and outputs customized features. After attention pooling, the composition branch computes the similarity between visual and semantic embeddings, while the other branches utilize the classifiers to predict classification labels. Focus-Consistent Constraint supervises the classification process across the three branches, which is illustrated minutely in \ref{['fig:FCC']}.
  • Figure 3: Focus-Consistent Constraint. Features from MFA reduce their spatial dimensions by attention pooling (AP). Visual embeddings $\widetilde{f}$ play a role in determining the score $S$ for each branch. Assigning attention map $M$ as the gradient of $S$ with respect to $f^\prime$ in each branch, we expect the focused region from the composition branch to be equal to the sum of the focused region from the other two branches.
  • Figure 4: Analysis of the used features in ResNet-18. The value of 4 indicates the use of all features, while 2 and 1 correspond to the use of the last two layers' features and the deepest layer's features, respectively. Our experiments show that incorporating the last three features from ResNet-18 yields the highest performance.
  • Figure 5: Qualitative Result. We present the top-3 predictions from the three datasets. The first six columns display examples in which our model performed accurately, while the last two columns showcase errors produced by the two incomplete models. The correct and incorrect results are marked in green and red borders.
  • ...and 3 more figures