Table of Contents
Fetching ...

Pseudo-triplet Guided Few-shot Composed Image Retrieval

Bohan Hou, Haoqiang Lin, Haokun Wen, Meng Liu, Mingzhu Xu, Xuemeng Song

TL;DR

A novel two-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR, to construct pseudo triplets from pure image data and use them to fulfill the CIR-task specific pretraining.

Abstract

Composed Image Retrieval (CIR) is a challenging task that aims to retrieve the target image with a multimodal query, i.e., a reference image, and its complementary modification text. As previous supervised or zero-shot learning paradigms all fail to strike a good trade-off between the model's generalization ability and retrieval performance, recent researchers have introduced the task of few-shot CIR (FS-CIR) and proposed a textual inversion-based network based on pretrained CLIP model to realize it. Despite its promising performance, the approach encounters two key limitations: simply relying on the few annotated samples for CIR model training and indiscriminately selecting training triplets for CIR model fine-tuning. To address these two limitations, we propose a novel two-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR. In the first stage, we propose an attentive masking and captioning-based pseudo triplet generation method, to construct pseudo triplets from pure image data and use them to fulfill the CIR-task specific pertaining. In the second stage, we propose a challenging triplet-based CIR fine-tuning method, where we design a pseudo modification text-based sample challenging score estimation strategy and a robust top range-based random sampling strategy for sampling robust challenging triplets to promote the model fine-tuning. Notably, our scheme is plug-and-play and compatible with any existing supervised CIR models. We test our scheme across two backbones on three public datasets (i.e., FashionIQ, CIRR, and Birds-to-Words), achieving maximum improvements of 13.3%, 22.2%, and 17.4% respectively, demonstrating our scheme's efficacy.

Pseudo-triplet Guided Few-shot Composed Image Retrieval

TL;DR

A novel two-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR, to construct pseudo triplets from pure image data and use them to fulfill the CIR-task specific pretraining.

Abstract

Composed Image Retrieval (CIR) is a challenging task that aims to retrieve the target image with a multimodal query, i.e., a reference image, and its complementary modification text. As previous supervised or zero-shot learning paradigms all fail to strike a good trade-off between the model's generalization ability and retrieval performance, recent researchers have introduced the task of few-shot CIR (FS-CIR) and proposed a textual inversion-based network based on pretrained CLIP model to realize it. Despite its promising performance, the approach encounters two key limitations: simply relying on the few annotated samples for CIR model training and indiscriminately selecting training triplets for CIR model fine-tuning. To address these two limitations, we propose a novel two-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR. In the first stage, we propose an attentive masking and captioning-based pseudo triplet generation method, to construct pseudo triplets from pure image data and use them to fulfill the CIR-task specific pertaining. In the second stage, we propose a challenging triplet-based CIR fine-tuning method, where we design a pseudo modification text-based sample challenging score estimation strategy and a robust top range-based random sampling strategy for sampling robust challenging triplets to promote the model fine-tuning. Notably, our scheme is plug-and-play and compatible with any existing supervised CIR models. We test our scheme across two backbones on three public datasets (i.e., FashionIQ, CIRR, and Birds-to-Words), achieving maximum improvements of 13.3%, 22.2%, and 17.4% respectively, demonstrating our scheme's efficacy.
Paper Structure (16 sections, 2 equations, 7 figures, 5 tables)

This paper contains 16 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of the CIR task.
  • Figure 2: PTG-FSCIR consists of two stages: pseudo triplet-based CIR pretraining and challenging triplet-based CIR fine-tuning.
  • Figure 3: Illustration of skewed challenging score distributions on three subsets of FashionIQ, CIRR, and B2W. The backbone model is SPRC r21.
  • Figure 4: Sensitivity experiments on the masking rate, with SPRC as the backbone on FashionIQ, the red line represents Average R@10, and the blue line represents Average R@50.
  • Figure 5: Illustration of CIR results by SPRC with or without our scheme on CIRR and FashionIQ, respectively. The ground-truth target images are highlighted with green boxes.
  • ...and 2 more figures