Table of Contents
Fetching ...

Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations

Xu Zhang, Zhedong Zheng, Linchao Zhu, Yi Yang

TL;DR

This work addresses triplet ambiguity in composed image retrieval, where noisy annotations yield many false negatives. It introduces Css-Net, a consensus-based architecture with four diverse compositors that learn collaboratively under a KL divergence loss, plus pyramid training to exploit multi-scale image features and text-image compositors to capture complementary cues. Empirical results on Shoes, FashionIQ, and Fashion200k show Css-Net consistently improves recall metrics over strong baselines, with notable gains on FashionIQ such as $+2.77\%$ in R@10 and $+6.67\%$ in R@50, and demonstrates the effectiveness of joint inference and compositor-level collaboration. The approach offers a robust, efficient solution to the intrinsic data ambiguity in composed retrieval, with practical impact for interactive search tasks that rely on reference images and descriptive captions.

Abstract

Composed image retrieval extends content-based image retrieval systems by enabling users to search using reference images and captions that describe their intention. Despite great progress in developing image-text compositors to extract discriminative visual-linguistic features, we identify a hitherto overlooked issue, triplet ambiguity, which impedes robust feature extraction. Triplet ambiguity refers to a type of semantic ambiguity that arises between the reference image, the relative caption, and the target image. It is mainly due to the limited representation of the annotated text, resulting in many noisy triplets where multiple visually dissimilar candidate images can be matched to an identical reference pair (i.e., a reference image + a relative caption). To address this challenge, we propose the Consensus Network (Css-Net), inspired by the psychological concept that groups outperform individuals. Css-Net comprises two core components: (1) a consensus module with four diverse compositors, each generating distinct image-text embeddings, fostering complementary feature extraction and mitigating dependence on any single, potentially biased compositor; (2) a Kullback-Leibler divergence loss that encourages learning of inter-compositor interactions to promote consensual outputs. During evaluation, the decisions of the four compositors are combined through a weighting scheme, enhancing overall agreement. On benchmark datasets, particularly FashionIQ, Css-Net demonstrates marked improvements. Notably, it achieves significant recall gains, with a 2.77% increase in R@10 and 6.67% boost in R@50, underscoring its competitiveness in addressing the fundamental limitations of existing methods.

Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations

TL;DR

This work addresses triplet ambiguity in composed image retrieval, where noisy annotations yield many false negatives. It introduces Css-Net, a consensus-based architecture with four diverse compositors that learn collaboratively under a KL divergence loss, plus pyramid training to exploit multi-scale image features and text-image compositors to capture complementary cues. Empirical results on Shoes, FashionIQ, and Fashion200k show Css-Net consistently improves recall metrics over strong baselines, with notable gains on FashionIQ such as in R@10 and in R@50, and demonstrates the effectiveness of joint inference and compositor-level collaboration. The approach offers a robust, efficient solution to the intrinsic data ambiguity in composed retrieval, with practical impact for interactive search tasks that rely on reference images and descriptive captions.

Abstract

Composed image retrieval extends content-based image retrieval systems by enabling users to search using reference images and captions that describe their intention. Despite great progress in developing image-text compositors to extract discriminative visual-linguistic features, we identify a hitherto overlooked issue, triplet ambiguity, which impedes robust feature extraction. Triplet ambiguity refers to a type of semantic ambiguity that arises between the reference image, the relative caption, and the target image. It is mainly due to the limited representation of the annotated text, resulting in many noisy triplets where multiple visually dissimilar candidate images can be matched to an identical reference pair (i.e., a reference image + a relative caption). To address this challenge, we propose the Consensus Network (Css-Net), inspired by the psychological concept that groups outperform individuals. Css-Net comprises two core components: (1) a consensus module with four diverse compositors, each generating distinct image-text embeddings, fostering complementary feature extraction and mitigating dependence on any single, potentially biased compositor; (2) a Kullback-Leibler divergence loss that encourages learning of inter-compositor interactions to promote consensual outputs. During evaluation, the decisions of the four compositors are combined through a weighting scheme, enhancing overall agreement. On benchmark datasets, particularly FashionIQ, Css-Net demonstrates marked improvements. Notably, it achieves significant recall gains, with a 2.77% increase in R@10 and 6.67% boost in R@50, underscoring its competitiveness in addressing the fundamental limitations of existing methods.
Paper Structure (17 sections, 10 equations, 6 figures, 10 tables)

This paper contains 17 sections, 10 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Schematic illustration of the composed image retrieval system. Through using a reference image and a relative caption, the system endeavors to precisely retrieve the intended target image from all candidate images.
  • Figure 2: Illustration of the triplet ambiguity problem. Triplet ambiguity denotes multiple false-negative samples in the dataset as the annotator usually see one triplet with true match () at a time, while neglecting other candidates ().
  • Figure 3: Schematic illustration of the Consensus Network. Given a reference image and a relative caption, the image encoder $F_{img}$ extracts the mid-level image feature $\bm{f_r^m}$ and high-level image feature $\bm{f_r^h}$, and the text encoder $F_{text}$ extracts the text feature $\bm{f_s}$. Then, compositors fuse the text feature with either the mid-level or high-level image feature. Each compositor generates distinct composed feature. Finally, we match the composed features with the corresponding target features and impose a KL loss between image-text compositors for training.
  • Figure 4: Comparison between the batch-based classification and the global-wise classification (GWC) on the Shoes dataset. GWC significantly degrades the performance since more false negative samples are involved due to triplet ambiguity.
  • Figure 5: Top-10 retrieval results on three datasets. The composed queries consist of a reference image and a relative caption that describes the desired modification. The blue/green boxes refer to the reference image and the true match(es).
  • ...and 1 more figures