Table of Contents
Fetching ...

FlowComposer: Composable Flows for Compositional Zero-Shot Learning

Zhenqi He, Lin Li, Long Chen

Abstract

Compositional zero-shot learning (CZSL) aims to recognize unseen attribute-object compositions by recombining primitives learned from seen pairs. Recent CZSL methods built on vision-language models (VLMs) typically adopt parameter-efficient fine-tuning (PEFT). They apply visual disentanglers for decomposition and manipulate token-level prompts or prefixes to encode compositions. However, such PEFT-based designs suffer from two fundamental limitations: (1) Implicit Composition Construction, where composition is realized only via token concatenation or branch-wise prompt tuning rather than an explicit operation in the embedding space; (2) Remained Feature Entanglement, where imperfect disentanglement leaves attribute, object, and composition features mutually contaminated. Together, these issues limit the generalization ability of current CZSL models. In this paper, we are the first to systematically study flow matching for CZSL and introduce FlowComposer, a model-agnostic framework that learns two primitive flows to transport visual features toward attribute and object text embeddings, and a learnable Composer that explicitly fuses their velocity fields into a composition flow. To exploit the inevitable residual entanglement, we further devise a leakage-guided augmentation scheme that reuses leaked features as auxiliary signals. We thoroughly evaluate FlowComposer on three public CZSL benchmarks by integrating it as a plug-and-play component into various baselines, consistently achieving significant improvements.

FlowComposer: Composable Flows for Compositional Zero-Shot Learning

Abstract

Compositional zero-shot learning (CZSL) aims to recognize unseen attribute-object compositions by recombining primitives learned from seen pairs. Recent CZSL methods built on vision-language models (VLMs) typically adopt parameter-efficient fine-tuning (PEFT). They apply visual disentanglers for decomposition and manipulate token-level prompts or prefixes to encode compositions. However, such PEFT-based designs suffer from two fundamental limitations: (1) Implicit Composition Construction, where composition is realized only via token concatenation or branch-wise prompt tuning rather than an explicit operation in the embedding space; (2) Remained Feature Entanglement, where imperfect disentanglement leaves attribute, object, and composition features mutually contaminated. Together, these issues limit the generalization ability of current CZSL models. In this paper, we are the first to systematically study flow matching for CZSL and introduce FlowComposer, a model-agnostic framework that learns two primitive flows to transport visual features toward attribute and object text embeddings, and a learnable Composer that explicitly fuses their velocity fields into a composition flow. To exploit the inevitable residual entanglement, we further devise a leakage-guided augmentation scheme that reuses leaked features as auxiliary signals. We thoroughly evaluate FlowComposer on three public CZSL benchmarks by integrating it as a plug-and-play component into various baselines, consistently achieving significant improvements.
Paper Structure (32 sections, 7 equations, 5 figures, 13 tables)

This paper contains 32 sections, 7 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: (a) Humans recognize new concepts by recombining familiar primitives. (b) Prior CZSL methods compose only at the token level, which may not yield valid unseen compositions in the embedding space. (c) We perform explicit composition in the embedding space via learned attribute and object flows.
  • Figure 2: Training dynamics and performance comparison with baseline - Troika huang2024troika. Our method yields a more balanced seen/unseen accuracy trajectory and consistently improves HM and AUC over the baseline on all three datasets.
  • Figure 3: Overall framework for FlowComposer. (a) Primitive flows: Two flow models learn time–conditioned velocities to transport primitive visual embeddings to their corresponding text embeddings. (b) Composer: the network to predict the combination coefficients supervised by least-squares targets. (c) Leakage-Guided Augmentation: To exploit residual cross-branch cues, each primitive flow is also trained to transport leaked features from the counterpart (or composition) branch to its own text target.
  • Figure 4: Visual comparisons between CSP nayaklearning, Troika huang2024troika and our FlowComposer on three datasets. Red represents the wrong prediction, and Green represents the right prediction.
  • Figure 5: The purple bar HTML]6B4C9A denotes the value of $\hat{a}$ and the green bar HTML]68873A denotes the value of $\hat{b}$, which are the coefficients for attribute and object velocities respectively.