Table of Contents
Fetching ...

Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder

Zheyuan Liu, Weixuan Sun, Damien Teney, Stephen Gould

TL;DR

This paper tackles composed image retrieval (CIR), where the goal is to retrieve an image $I_C$ that best matches a query $q= I_R, t $ consisting of a reference image and modification text. It proposes a two-stage pipeline: a fast candidate filtering stage using a BLIP-based multi-modal encoder to produce a query-aware embedding $z_t$ and cosine-similarity against precomputed candidate embeddings, followed by a more expressive re-ranking stage that jointly reasons over the query and a small set of top-$K$ candidates via a dual-encoder architecture with cross-attention and a merging mechanism. The re-ranking model is trained separately with contrastive losses on triplets $  I_R, t, I_C $ and batch negatives, enabling rich query–candidate interactions while keeping inference tractable. Across Fashion-IQ and CIRR, the approach yields state-of-the-art results, with the re-ranking stage providing substantial gains over single-stage methods and other BLIP-based baselines. The work demonstrates the practicality of a two-stage CIR framework and contributes architectural insights for cross-modal triplet reasoning, supported by open-source code for reproducibility and further research.

Abstract

Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates. Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks. Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task. Our implementation is available at https://github.com/Cuberick-Orion/Candidate-Reranking-CIR.

Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder

TL;DR

This paper tackles composed image retrieval (CIR), where the goal is to retrieve an image that best matches a query consisting of a reference image and modification text. It proposes a two-stage pipeline: a fast candidate filtering stage using a BLIP-based multi-modal encoder to produce a query-aware embedding and cosine-similarity against precomputed candidate embeddings, followed by a more expressive re-ranking stage that jointly reasons over the query and a small set of top- candidates via a dual-encoder architecture with cross-attention and a merging mechanism. The re-ranking model is trained separately with contrastive losses on triplets and batch negatives, enabling rich query–candidate interactions while keeping inference tractable. Across Fashion-IQ and CIRR, the approach yields state-of-the-art results, with the re-ranking stage providing substantial gains over single-stage methods and other BLIP-based baselines. The work demonstrates the practicality of a two-stage CIR framework and contributes architectural insights for cross-modal triplet reasoning, supported by open-source code for reproducibility and further research.

Abstract

Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates. Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks. Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task. Our implementation is available at https://github.com/Cuberick-Orion/Candidate-Reranking-CIR.
Paper Structure (27 sections, 2 equations, 8 figures, 7 tables)

This paper contains 27 sections, 2 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: An illustration of the two-stage scheme, with easy negatives pre-filtered out, and the remaining candidates re-ranked.
  • Figure 2: Overall training pipeline. In both stages, we freeze the image encoders (dashed fillings), as detailed in Section \ref{['sec:main2-implementation-details']}. (Left) Candidate filtering model, which takes as input the tokenized text and cross-attends it with the reference image. The output is the sequential feature $z_t$, where we extract the [CLS] token as the summarized representation of the query $q=\langle I_\text{R},t\rangle$ to compare its similarity with features of $I_\text{T}'$. (Right) Candidate re-ranking model with dual-encoder architecture. Stacked elements signify that we exhaustively pair up each candidate $I_\text{T}'$ among the selected top-$K$ with the query $q$ for assessment. Note that the two encoders take in different inputs for cross-attention. The output [CLS] tokens are concatenated and passed for producing a logit. Note that the two stages are two separate models and not jointly trained.
  • Figure 3: Details of the transformer layer in our dual-encoder architecture. Here, we take the first layer as an example. SA: Self-attention layer, CA: Cross-attention layer, FF: Feed-forward layer. $\oplus$: element-wise addition for residual connections. All modules in the figure are being trained. Dashed fillings on FF suggest weight-sharing.
  • Figure 4: Qualitative examples on CIRR. For each sample, we showcase the query (left) with the filtered top-6 candidates (row F), followed by the re-ranked top-6 results (row R). True targets are in green frames. We demonstrate three cases where re-ranking brings the true target forward (a-c), and one failure case (d).
  • Figure 5: Qualitative examples on Fashion-IQ. For each sample, we showcase the query (left) with the filtered top-6 candidates (row F), followed by the re-ranked top-6 results (row R). Each query comes with two sentences of annotations which is joined by "and". True targets are in green frames. For examples with ground truth initially ranked beyond the top-6, we report their rankings below the annotation, as in (a) and (c).
  • ...and 3 more figures