Table of Contents
Fetching ...

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

Geon Park, Ji-Hoon Park, Seong-Whan Lee

TL;DR

DQE-CIR is proposed, a method designed to learn distinctive query embeddings by explicitly modeling target relative relevance during training that enables more reliable retrieval for fine-grained attribute changes by improving query discriminativeness and reducing confusion caused by semantically similar but irrelevant candidates.

Abstract

Composed image retrieval (CIR) addresses the task of retrieving a target image by jointly interpreting a reference image and a modification text that specifies the intended change. Most existing methods are still built upon contrastive learning frameworks that treat the ground truth image as the only positive instance and all remaining images as negatives. This strategy inevitably introduces relevance suppression, where semantically related yet valid images are incorrectly pushed away, and semantic confusion, where different modification intents collapse into overlapping regions of the embedding space. As a result, the learned query representations often lack discriminativeness, particularly at fine-grained attribute modifications. To overcome these limitations, we propose distinctive query embeddings through learnable attribute weights and target relative negative sampling (DQE-CIR), a method designed to learn distinctive query embeddings by explicitly modeling target relative relevance during training. DQE-CIR incorporates learnable attribute weighting to emphasize distinctive visual features conditioned on the modification text, enabling more precise feature alignment between language and vision. Furthermore, we introduce target relative negative sampling, which constructs a target relative similarity distribution and selects informative negatives from a mid-zone region that excludes both easy negatives and ambiguous false negatives. This strategy enables more reliable retrieval for fine-grained attribute changes by improving query discriminativeness and reducing confusion caused by semantically similar but irrelevant candidates.

DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

TL;DR

DQE-CIR is proposed, a method designed to learn distinctive query embeddings by explicitly modeling target relative relevance during training that enables more reliable retrieval for fine-grained attribute changes by improving query discriminativeness and reducing confusion caused by semantically similar but irrelevant candidates.

Abstract

Composed image retrieval (CIR) addresses the task of retrieving a target image by jointly interpreting a reference image and a modification text that specifies the intended change. Most existing methods are still built upon contrastive learning frameworks that treat the ground truth image as the only positive instance and all remaining images as negatives. This strategy inevitably introduces relevance suppression, where semantically related yet valid images are incorrectly pushed away, and semantic confusion, where different modification intents collapse into overlapping regions of the embedding space. As a result, the learned query representations often lack discriminativeness, particularly at fine-grained attribute modifications. To overcome these limitations, we propose distinctive query embeddings through learnable attribute weights and target relative negative sampling (DQE-CIR), a method designed to learn distinctive query embeddings by explicitly modeling target relative relevance during training. DQE-CIR incorporates learnable attribute weighting to emphasize distinctive visual features conditioned on the modification text, enabling more precise feature alignment between language and vision. Furthermore, we introduce target relative negative sampling, which constructs a target relative similarity distribution and selects informative negatives from a mid-zone region that excludes both easy negatives and ambiguous false negatives. This strategy enables more reliable retrieval for fine-grained attribute changes by improving query discriminativeness and reducing confusion caused by semantically similar but irrelevant candidates.
Paper Structure (28 sections, 8 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 28 sections, 8 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Example of attribute-aware relevance separation in CIR. Given a reference image and the modification text specifying green and short-sleeved, candidate images are grouped according to their attribute alignment. The red box shows irrelevant images that violate key attributes and commonly appear in top-ranked results of existing CIR models. In contrast, the blue box shows relevant images that satisfy both attributes—results that DQE-CIR retrieves consistently through attribute-aware query embeddings and distinctive ranking constraints.
  • Figure 2: Overview of the proposed DQE-CIR framework. DQE-CIR first encodes the reference image, modification text, and candidate images using BLIP-2 to obtain base query representations. Learnable Attribute Weights then enhance the color- and shape-specific sub-queries, which are combined to form the final attribute-aware query embedding and optimized through KL-divergence and attribute-aware margin losses. In parallel, Target Relative Negative Sampling (TRNS) selects a single negative from the $\Delta$-score–based mid-zone, enabling distinctive pairwise ranking training. Together, these components strengthen fine-grained attribute sensitivity and improve retrieval discriminativeness.
  • Figure 3: Illustration of Target Relative Negative Sampling (TRNS). The total query embedding is compared with the target embedding and all image corpus embeddings using cosine similarity to derive relevance scores. Subsequently, the $\Delta$-score is computed as the difference between the target similarity and each candidate similarity. Candidate images whose $\Delta$-score falls within a predefined mid-zone are designated as informative negatives for training.
  • Figure 4: Qualitative comparison of different CIR models on the FashionIQ validation dataset. Given a reference image and a modification text specifying “a blue short-sleeved shirt with white lettering”, we compare the top-ranked retrieval results produced by different CIR methods. While baseline models often retrieve images that partially match the query attributes, DQE-CIR retrieves images that correctly satisfy all specified modifications.
  • Figure 5: Qualitative comparison of DQE-CIR on the CIRR test dataset across different compositional categories. Retrieval results are conditioned on four categories of modifications: (a) Color, (b) Quantity, (c) Appearance, and (d) Form. Given a reference image and a modification text, the model retrieves images that apply the specified attribute changes while preserving the object.
  • ...and 2 more figures