Table of Contents
Fetching ...

TsCA: On the Semantic Consistency Alignment via Conditional Transport for Compositional Zero-Shot Learning

Miaoge Li, Jingcai Guo, Richard Yi Da Xu, Dongsheng Wang, Xiaofeng Cao, Zhijie Rao, Song Guo

TL;DR

This work reframes CZSL as a triplet distribution alignment problem among image patches ($P_1$), text prompts for compositions ($P_2$), and primitive text cues ($P_3$) using Consistency-aware Conditional Transport (CCT). A cycle-consistency constraint enforces coherent transport across the three sets, and a primitive decoupler improves state/object disentanglement, while an open-world filtering mechanism prunes infeasible state–object pairs. Training combines base classification losses with CT and regularization terms to reinforce semantic coherence across modalities. Empirical results on MIT-States, UT-Zappos, and CGQA show state-of-the-art performance in both closed- and open-world CZSL, with ablations confirming the effectiveness of CT, cycle-consistency, and decoupling components for robust compositional generalization.

Abstract

Compositional Zero-Shot Learning (CZSL) aims to recognize novel state-object compositions by leveraging the shared knowledge of their primitive components. Despite considerable progress, effectively calibrating the bias between semantically similar multimodal representations, as well as generalizing pre-trained knowledge to novel compositional contexts, remains an enduring challenge. In this paper, our interest is to revisit the conditional transport (CT) theory and its homology to the visual-semantics interaction in CZSL and further, propose a novel Trisets Consistency Alignment framework (dubbed TsCA) that well-addresses these issues. Concretely, we utilize three distinct yet semantically homologous sets, i.e., patches, primitives, and compositions, to construct pairwise CT costs to minimize their semantic discrepancies. To further ensure the consistency transfer within these sets, we implement a cycle-consistency constraint that refines the learning by guaranteeing the feature consistency of the self-mapping during transport flow, regardless of modality. Moreover, we extend the CT plans to an open-world setting, which enables the model to effectively filter out unfeasible pairs, thereby speeding up the inference as well as increasing the accuracy. Extensive experiments are conducted to verify the effectiveness of the proposed method.

TsCA: On the Semantic Consistency Alignment via Conditional Transport for Compositional Zero-Shot Learning

TL;DR

This work reframes CZSL as a triplet distribution alignment problem among image patches (), text prompts for compositions (), and primitive text cues () using Consistency-aware Conditional Transport (CCT). A cycle-consistency constraint enforces coherent transport across the three sets, and a primitive decoupler improves state/object disentanglement, while an open-world filtering mechanism prunes infeasible state–object pairs. Training combines base classification losses with CT and regularization terms to reinforce semantic coherence across modalities. Empirical results on MIT-States, UT-Zappos, and CGQA show state-of-the-art performance in both closed- and open-world CZSL, with ablations confirming the effectiveness of CT, cycle-consistency, and decoupling components for robust compositional generalization.

Abstract

Compositional Zero-Shot Learning (CZSL) aims to recognize novel state-object compositions by leveraging the shared knowledge of their primitive components. Despite considerable progress, effectively calibrating the bias between semantically similar multimodal representations, as well as generalizing pre-trained knowledge to novel compositional contexts, remains an enduring challenge. In this paper, our interest is to revisit the conditional transport (CT) theory and its homology to the visual-semantics interaction in CZSL and further, propose a novel Trisets Consistency Alignment framework (dubbed TsCA) that well-addresses these issues. Concretely, we utilize three distinct yet semantically homologous sets, i.e., patches, primitives, and compositions, to construct pairwise CT costs to minimize their semantic discrepancies. To further ensure the consistency transfer within these sets, we implement a cycle-consistency constraint that refines the learning by guaranteeing the feature consistency of the self-mapping during transport flow, regardless of modality. Moreover, we extend the CT plans to an open-world setting, which enables the model to effectively filter out unfeasible pairs, thereby speeding up the inference as well as increasing the accuracy. Extensive experiments are conducted to verify the effectiveness of the proposed method.
Paper Structure (23 sections, 19 equations, 5 figures, 3 tables)

This paper contains 23 sections, 19 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: We represent each image as a set of patch embeddings and two sets of textual embeddings and employ semantic consistency conditional transport to align such cross-modal distribution trio.
  • Figure 2: The overall framework of the proposed TsCA (zoom-in for more details).
  • Figure 3: Qualitative results of the intra-modal transport plans on the MIT-States. For each sample, we show an image with the ground-truth composition, with the state indicated in red and the object in blue. The top-3 predictions are presented in two formats: from the primitive class to the composition set in the first two columns, and from the composition label to the primitive set in the third and fourth columns. The annotations 'p','s', 'o', and 'c' correspond to patch, state, object, and composition, respectively.
  • Figure 4: Cycle-consistency (zoom-in for more details).
  • Figure 5: Visualization of the cross-modal transport plans on the Mit-States. Columns 1-3 represent the transport from the ground-truth state and object points in the primitive set, as well as the composition points in the composition set, to the patch set.