Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization

Yiyang Chen; Zhedong Zheng; Wei Ji; Leigang Qu; Tat-Seng Chua

Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization

Yiyang Chen, Zhedong Zheng, Wei Ji, Leigang Qu, Tat-Seng Chua

TL;DR

This work addresses text-guided image retrieval where users transition from coarse to fine feedback, identifying a misalignment between traditional one-to-one metric learning and multi-grained retrieval. It proposes a unified learning framework with two modules: uncertainty modeling, which introduces Gaussian fluctuation to simulate intra-class jitter, and uncertainty regularization, which adaptively scales the learning objective to handle one-to-many matching during training. The method combines a coarse-grained loss with a dynamically weighted fine-grained loss, yielding improved Recall@K across FashionIQ, Fashion200k, and Shoes, and demonstrates robustness through ablations and comparisons with strong baselines. The approach offers practical benefits for real-world retrieval by preserving potential candidates early in retrieval and integrating multi-grained supervision into a single optimization objective.

Abstract

We investigate composed image retrieval with text feedback. Users gradually look for the target of interest by moving from coarse to fine-grained feedback. However, existing methods merely focus on the latter, i.e., fine-grained search, by harnessing positive and negative pairs during training. This pair-based paradigm only considers the one-to-one distance between a pair of specific points, which is not aligned with the one-to-many coarse-grained retrieval process and compromises the recall rate. In an attempt to fill this gap, we introduce a unified learning approach to simultaneously modeling the coarse- and fine-grained retrieval by considering the multi-grained uncertainty. The key idea underpinning the proposed method is to integrate fine- and coarse-grained retrieval as matching data points with small and large fluctuations, respectively. Specifically, our method contains two modules: uncertainty modeling and uncertainty regularization. (1) The uncertainty modeling simulates the multi-grained queries by introducing identically distributed fluctuations in the feature space. (2) Based on the uncertainty modeling, we further introduce uncertainty regularization to adapt the matching objective according to the fluctuation range. Compared with existing methods, the proposed strategy explicitly prevents the model from pushing away potential candidates in the early stage, and thus improves the recall rate. On the three public datasets, i.e., FashionIQ, Fashion200k, and Shoes, the proposed method has achieved +4.03%, +3.38%, and +2.40% Recall@50 accuracy over a strong baseline, respectively.

Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization

TL;DR

Abstract

Paper Structure (14 sections, 6 equations, 3 figures, 4 tables)

This paper contains 14 sections, 6 equations, 3 figures, 4 tables.

Introduction
Related work
Composed Image Retrieval with Text Feedback
Uncertainty Learning
Method
Problem Definition.
Uncertainty Modeling
Uncertainty Regularization
Experiment
Implementation Details
Datasets
Comparison with Competitive Methods
Further Analysis and Discussions
Conclusion

Figures (3)

Figure 1: (a) The typical retrieval process contains two steps, i.e., the coarse-grained retrieval and fine-grained retrieval. The coarse-grained retrieval harnesses the brief descriptions or imprecise query images, while the fine-grained retrieval requires more details for one-to-one mapping. The existing approaches usually focus on optimizing the strict pair-wise distance during training, which is different from the one-to-many coarse-grained test setting. Overwhelming one-to-one metric learning compromises the model to recall potential candidates. (b) Our intuition. We notice that there exist two typical matching types for the fine- and coarse-grained retrieval. Here we show the difference between one-to-one matching (left) and one-to-many matching (right).
Figure 2: The overview of our network. Given the source image $I_s$ and the text $T_s$ for modification, we obtain the composed features $f_s$ by combining $f^T_s$ and $f^I_s$ via compositor. The compositor contains a content module and a style module. Meanwhile, we extract the visual features $f_t$ of the target image $I_t$ via the same image encoder as the source image. Our main contributions are the uncertainty modeling via augmenter, and the uncertainty regularization for coarse matching. (1) The proposed augmenter applies feature-level noise to $f_t$, yielding $\hat{f_t}$ with identical Gaussian Noise $N(1,\sigma_t)$ and $N(\mu_t,\sigma_t)$, respectively. Albeit simple, it is worth noting that the augmented feature $\hat{f_t}$ simulates the intra-class jittering of the target image, following the original feature distribution. (2) The commonly used InfoNCE loss focuses on the fine-grained one-to-one mapping between the original target feature $f_t$ and the composited feature $f_s$. Different from InfoNCE loss, the proposed method harnesses the augmented feature $\hat{f_t}$ and $f_s$ to simulate the one-to-many mapping, considering different fluctuations during training. Our model applies both the fine-grained matching and the proposed coarse-grained uncertainty regularization, facilitating the model training.
Figure 3: Qualitative image retrieval result on FashionIQ, Fashion200k and Shoes. We mainly compare the top-5 ranking list of the proposed method with the baseline. (Please zoom in for better visualization.)

Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization

TL;DR

Abstract

Composed Image Retrieval with Text Feedback via Multi-grained Uncertainty Regularization

Authors

TL;DR

Abstract

Table of Contents

Figures (3)