Focus-Consistent Multi-Level Aggregation for Compositional Zero-Shot Learning
Fengyuan Dai, Siteng Huang, Min Zhang, Biao Gong, Donglin Wang
TL;DR
This work tackles compositional zero-shot learning by addressing inconsistency and lack of cross-branch spatial sharing in the common three-branch setup. It introduces Focus-Consistent Multi-Level Aggregation (FOMA), combining a Multi-Level Feature Aggregation (MFA) module that produces instance-specific, per-branch features from $f_1,f_2,f_3$ and a Focus-Consistent Constraint (FCC) that aligns attention across branches using Grad-CAM–style maps. The model is trained with $L = L_{cls} + \alpha L_f$, and inference sums scores from all branches, including a CGE–GCN composition branch. Experiments on UT-Zappos, C-GQA, and Clothing16K show state-of-the-art performance across HM, AUC, and seen/unseen balance, validating the effectiveness of per-branch feature specialization and cross-branch focus alignment for CZSL.
Abstract
To transfer knowledge from seen attribute-object compositions to recognize unseen ones, recent compositional zero-shot learning (CZSL) methods mainly discuss the optimal classification branches to identify the elements, leading to the popularity of employing a three-branch architecture. However, these methods mix up the underlying relationship among the branches, in the aspect of consistency and diversity. Specifically, consistently providing the highest-level features for all three branches increases the difficulty in distinguishing classes that are superficially similar. Furthermore, a single branch may focus on suboptimal regions when spatial messages are not shared between the personalized branches. Recognizing these issues and endeavoring to address them, we propose a novel method called Focus-Consistent Multi-Level Aggregation (FOMA). Our method incorporates a Multi-Level Feature Aggregation (MFA) module to generate personalized features for each branch based on the image content. Additionally, a Focus-Consistent Constraint encourages a consistent focus on the informative regions, thereby implicitly exchanging spatial information between all branches. Extensive experiments on three benchmark datasets (UT-Zappos, C-GQA, and Clothing16K) demonstrate that our FOMA outperforms SOTA.
