Table of Contents
Fetching ...

Exploring Transferable Homogeneous Groups for Compositional Zero-Shot Learning

Zhijie Rao, Jingcai Guo, Miaoge Li, Yang Chen

TL;DR

This paper tackles the CZSL challenge of conditional dependency between states and objects by introducing Homogeneous Group Representation Learning (HGRL), which forms multiple homogeneous groups to balance transferability and discriminability. It couples a Group-Aware Visual Representation with a Decoupled Group Prompt and a Group-Aware Pair Enhancement module, all grounded on a CLIP backbone and guided by a text co-occurrence probability graph to discover group structure without explicit hierarchical labels. The approach demonstrates state-of-the-art performance across MIT-States, UT-Zappos, and C-GQA in both closed- and open-world settings, with ablation and visualization analyses showing the contributions of each component. Theoretical insights frame the method as reducing distributional divergence across domains via homogeneous grouping, suggesting strong practical impact for robust compositional reasoning and zero-shot generalization.

Abstract

Conditional dependency present one of the trickiest problems in Compositional Zero-Shot Learning, leading to significant property variations of the same state (object) across different objects (states). To address this problem, existing approaches often adopt either all-to-one or one-to-one representation paradigms. However, these extremes create an imbalance in the seesaw between transferability and discriminability, favoring one at the expense of the other. Comparatively, humans are adept at analogizing and reasoning in a hierarchical clustering manner, intuitively grouping categories with similar properties to form cohesive concepts. Motivated by this, we propose Homogeneous Group Representation Learning (HGRL), a new perspective formulates state (object) representation learning as multiple homogeneous sub-group representation learning. HGRL seeks to achieve a balance between semantic transferability and discriminability by adaptively discovering and aggregating categories with shared properties, learning distributed group centers that retain group-specific discriminative features. Our method integrates three core components designed to simultaneously enhance both the visual and prompt representation capabilities of the model. Extensive experiments on three benchmark datasets validate the effectiveness of our method.

Exploring Transferable Homogeneous Groups for Compositional Zero-Shot Learning

TL;DR

This paper tackles the CZSL challenge of conditional dependency between states and objects by introducing Homogeneous Group Representation Learning (HGRL), which forms multiple homogeneous groups to balance transferability and discriminability. It couples a Group-Aware Visual Representation with a Decoupled Group Prompt and a Group-Aware Pair Enhancement module, all grounded on a CLIP backbone and guided by a text co-occurrence probability graph to discover group structure without explicit hierarchical labels. The approach demonstrates state-of-the-art performance across MIT-States, UT-Zappos, and C-GQA in both closed- and open-world settings, with ablation and visualization analyses showing the contributions of each component. Theoretical insights frame the method as reducing distributional divergence across domains via homogeneous grouping, suggesting strong practical impact for robust compositional reasoning and zero-shot generalization.

Abstract

Conditional dependency present one of the trickiest problems in Compositional Zero-Shot Learning, leading to significant property variations of the same state (object) across different objects (states). To address this problem, existing approaches often adopt either all-to-one or one-to-one representation paradigms. However, these extremes create an imbalance in the seesaw between transferability and discriminability, favoring one at the expense of the other. Comparatively, humans are adept at analogizing and reasoning in a hierarchical clustering manner, intuitively grouping categories with similar properties to form cohesive concepts. Motivated by this, we propose Homogeneous Group Representation Learning (HGRL), a new perspective formulates state (object) representation learning as multiple homogeneous sub-group representation learning. HGRL seeks to achieve a balance between semantic transferability and discriminability by adaptively discovering and aggregating categories with shared properties, learning distributed group centers that retain group-specific discriminative features. Our method integrates three core components designed to simultaneously enhance both the visual and prompt representation capabilities of the model. Extensive experiments on three benchmark datasets validate the effectiveness of our method.
Paper Structure (17 sections, 24 equations, 4 figures, 2 tables)

This paper contains 17 sections, 24 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: (a) Motivation. Humans utilize higher-order knowledge structure to perform hierarchical clustering, enabling effective analogies and reasoning. (b) Visualization of the state— old. The semantic structure of homogeneous groups is well maintained in deep feature space, e.g., animals are naturally clustered together. (c) Visualizaiton of the object— apple. Similarly, apples in various states cluster in groups due to visual variation.
  • Figure 2: Overview of the proposed method, which comprises three main components to enhance both visual and prompt representations. Note that the two branches of state and object are symmetric, so just one branch is presented. State (object) visual enhancement: GAVR deeply explores latent homogeneous groups and utilizes multi-expert networks to extract group-specific representations to maintain semantic integrity and transferability. Pair visual enhancement: GAPE integrates state and object features for joint recognition and introduces group-aware feature augmentation to improve pair diversity. Text prompt enhancement: Unlike traditional category prompts, DGP additionally learns a customized contextual prompt for each group.
  • Figure 3: (a) The effect of group number for state branch. (b) The effect of group number for object branch. (c) The sensitivity of $\lambda$ on UT-Zappos. (d) The sensitivity of $\lambda$ on C-GQA.
  • Figure 4: (a) Attention visualization for GAVR. (b-c) T-SNE analysis for DGP. Red pentagrams indicate group prompt representations. Dots indicate image representations. (b) Shows the state whose label is Synthetic in UT-Zappos. Different colored dots indicate different objects. (c) Shows the object whose label is Shoes.Sneakers.and.Athletic.Shoes in UT-Zappos. Different colored dots indicate different states.