Table of Contents
Fetching ...

Self-Training Boosted Multi-Factor Matching Network for Composed Image Retrieval

Haokun Wen, Xuemeng Song, Jianhua Yin, Jianlong Wu, Weili Guan, Liqiang Nie

TL;DR

This work tackles composed image retrieval by addressing the lack of multi-factor matching analysis and the underutilization of unlabeled data. It introduces LIMN, a CLIP-Transformer based network that learns disentangled latent factor tokens, performs dual aggregation to produce a robust final matching token, and optimizes both token- and factor-level matching losses. To boost generalization, LIMN+ employs an iterative dual self-training loop that uses an image difference captioning model to generate pseudo triplets from unlabeled pairs and filters them with LIMN, achieving state-of-the-art results on FashionIQ, Shoes, CIRR, and Fashion200K. The findings demonstrate that modeling latent matching factors and leveraging unlabeled data substantially improves CIR performance, with LIMN+ offering a practical, plug-in enhancement for existing CIR models and datasets.

Abstract

The composed image retrieval (CIR) task aims to retrieve the desired target image for a given multimodal query, i.e., a reference image with its corresponding modification text. The key limitations encountered by existing efforts are two aspects: 1) ignoring the multi-faceted query-target matching factors; 2) ignoring the potential unlabeled reference-target image pairs in existing benchmark datasets. To address these two limitations is non-trivial due to the following challenges: 1) how to effectively model the multi-faceted matching factors in a latent way without direct supervision signals; 2) how to fully utilize the potential unlabeled reference-target image pairs to improve the generalization ability of the CIR model. To address these challenges, in this work, we first propose a muLtI-faceted Matching Network (LIMN), which consists of three key modules: multi-grained image/text encoder, latent factor-oriented feature aggregation, and query-target matching modeling. Thereafter, we design an iterative dual self-training paradigm to further enhance the performance of LIMN by fully utilizing the potential unlabeled reference-target image pairs in a semi-supervised manner. Specifically, we denote the iterative dual self-training paradigm enhanced LIMN as LIMN+. Extensive experiments on three real-world datasets, FashionIQ, Shoes, and Birds-to-Words, show that our proposed method significantly surpasses the state-of-the-art baselines.

Self-Training Boosted Multi-Factor Matching Network for Composed Image Retrieval

TL;DR

This work tackles composed image retrieval by addressing the lack of multi-factor matching analysis and the underutilization of unlabeled data. It introduces LIMN, a CLIP-Transformer based network that learns disentangled latent factor tokens, performs dual aggregation to produce a robust final matching token, and optimizes both token- and factor-level matching losses. To boost generalization, LIMN+ employs an iterative dual self-training loop that uses an image difference captioning model to generate pseudo triplets from unlabeled pairs and filters them with LIMN, achieving state-of-the-art results on FashionIQ, Shoes, CIRR, and Fashion200K. The findings demonstrate that modeling latent matching factors and leveraging unlabeled data substantially improves CIR performance, with LIMN+ offering a practical, plug-in enhancement for existing CIR models and datasets.

Abstract

The composed image retrieval (CIR) task aims to retrieve the desired target image for a given multimodal query, i.e., a reference image with its corresponding modification text. The key limitations encountered by existing efforts are two aspects: 1) ignoring the multi-faceted query-target matching factors; 2) ignoring the potential unlabeled reference-target image pairs in existing benchmark datasets. To address these two limitations is non-trivial due to the following challenges: 1) how to effectively model the multi-faceted matching factors in a latent way without direct supervision signals; 2) how to fully utilize the potential unlabeled reference-target image pairs to improve the generalization ability of the CIR model. To address these challenges, in this work, we first propose a muLtI-faceted Matching Network (LIMN), which consists of three key modules: multi-grained image/text encoder, latent factor-oriented feature aggregation, and query-target matching modeling. Thereafter, we design an iterative dual self-training paradigm to further enhance the performance of LIMN by fully utilizing the potential unlabeled reference-target image pairs in a semi-supervised manner. Specifically, we denote the iterative dual self-training paradigm enhanced LIMN as LIMN+. Extensive experiments on three real-world datasets, FashionIQ, Shoes, and Birds-to-Words, show that our proposed method significantly surpasses the state-of-the-art baselines.
Paper Structure (27 sections, 7 equations, 7 figures, 5 tables)

This paper contains 27 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: An example of the multiple query-target matching factors between the multimodal query and the target image. Note that semantics like "color" or "belt" only convey a concept since the matching factors are implicitly contained.
  • Figure 2: Illustration of our proposed LIMN. It consists of three modules: (a) disentanglement-based latent factor tokens mining, (b) dual aggregation-based matching token learning, and (c) dual query-target matching modeling.
  • Figure 3: The proposed iterative dual self-training paradigm for boosting the performance of LIMN and deriving LIMN+, which consists of five steps.
  • Figure 4: Influence of the number of latent factor tokens $U$ on (a) FashionIQ, (b) Shoes, and (c) CIRR.
  • Figure 5: The iterative performance of our CIR model LIMN and the IDC model DUDA under the iterative dual self-training paradigm on FashionIQ, Shoes, and CIRR datasets. The performance of CIR is the average of R@$k$, and that of IDC is the average between BLEU-$1$ and ROUGE-L. The initial bar in each plot refers to the initial performance of the original model, while the iter$*$ bar denotes the $*$-th self-training iteration, respectively. The iteration stops when the performance gains are very limited (a and c) or the performance decreases (b). The bars with white slashes indicate the optimal performance reported in this work.
  • ...and 2 more figures