Table of Contents
Fetching ...

DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval

Yuxin Yang, Yinan Zhou, Yuxin Chen, Ziqi Zhang, Zongyang Ma, Chunfeng Yuan, Bing Li, Lin Song, Jun Gao, Peng Li, Weiming Hu

TL;DR

This work introduces DetailFusion, a dual-branch framework for Composed Image Retrieval that explicitly models both global semantics and fine-grained visual details. A Detail-oriented Inference (DI) branch and a Global Feature Matching (GM) branch are adaptively fused by an Adaptive Feature Compositor, trained through a three-stage strategy that leverages image-editing data (IPr2Pr) to sharpen detail perception. Results on CIRR and FashionIQ demonstrate state-of-the-art performance, with ablations confirming the necessity of pretraining, dual-branch coordination, and the compositional fusion design. The approach offers cross-domain robustness and practical gains for CIR tasks requiring precise interpretation of textual modifications and subtle visual changes.

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images from a gallery based on a reference image and modification text as a combined query. Recent approaches focus on balancing global information from two modalities and encode the query into a unified feature for retrieval. However, due to insufficient attention to fine-grained details, these coarse fusion methods often struggle with handling subtle visual alterations or intricate textual instructions. In this work, we propose DetailFusion, a novel dual-branch framework that effectively coordinates information across global and detailed granularities, thereby enabling detail-enhanced CIR. Our approach leverages atomic detail variation priors derived from an image editing dataset, supplemented by a detail-oriented optimization strategy to develop a Detail-oriented Inference Branch. Furthermore, we design an Adaptive Feature Compositor that dynamically fuses global and detailed features based on fine-grained information of each unique multimodal query. Extensive experiments and ablation analyses not only demonstrate that our method achieves state-of-the-art performance on both CIRR and FashionIQ datasets but also validate the effectiveness and cross-domain adaptability of detail enhancement for CIR.

DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval

TL;DR

This work introduces DetailFusion, a dual-branch framework for Composed Image Retrieval that explicitly models both global semantics and fine-grained visual details. A Detail-oriented Inference (DI) branch and a Global Feature Matching (GM) branch are adaptively fused by an Adaptive Feature Compositor, trained through a three-stage strategy that leverages image-editing data (IPr2Pr) to sharpen detail perception. Results on CIRR and FashionIQ demonstrate state-of-the-art performance, with ablations confirming the necessity of pretraining, dual-branch coordination, and the compositional fusion design. The approach offers cross-domain robustness and practical gains for CIR tasks requiring precise interpretation of textual modifications and subtle visual changes.

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images from a gallery based on a reference image and modification text as a combined query. Recent approaches focus on balancing global information from two modalities and encode the query into a unified feature for retrieval. However, due to insufficient attention to fine-grained details, these coarse fusion methods often struggle with handling subtle visual alterations or intricate textual instructions. In this work, we propose DetailFusion, a novel dual-branch framework that effectively coordinates information across global and detailed granularities, thereby enabling detail-enhanced CIR. Our approach leverages atomic detail variation priors derived from an image editing dataset, supplemented by a detail-oriented optimization strategy to develop a Detail-oriented Inference Branch. Furthermore, we design an Adaptive Feature Compositor that dynamically fuses global and detailed features based on fine-grained information of each unique multimodal query. Extensive experiments and ablation analyses not only demonstrate that our method achieves state-of-the-art performance on both CIRR and FashionIQ datasets but also validate the effectiveness and cross-domain adaptability of detail enhancement for CIR.

Paper Structure

This paper contains 25 sections, 11 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Workflows of existing supervised CIR methods and Ours: (a) Late fusion, (b) Textual inversion, and (c) Our proposed DetailFusion. Due to the lack of dedicated modules for fine-grained details, the first two prevalent CIR methods struggle to perceive fine-grained details in the reference image while processing complex requirements in the modification text. (a) Fails to implement the textual transformation 'Remove' and misses the requirement 'tongue out'. (b) Overlooks implicit visual detail information, failing to preserve the dog's breed. (c) Our approach combines global semantic understanding with fine-grained detail interaction, resulting in accurate retrieval outcomes.
  • Figure 2: Overall pipeline of our proposed DetailFusion. The upper section illustrates the training phase, while the lower-left displays the datasets utilized at each training stage, and the lower-right illustrates the inference phase. During the training phase, the DI branch is first pre-trained on the IPr2Pr dataset ipr2pr, followed by joint fine-tuning of both the DI and GM branches on CIR datasets. Finally, the parameters of both branches are frozen, and the Compositor is trained from scratch on the corresponding CIR datasets. During the inference phase, the multimodal query is encoded through the DI and GM branches to extract fine-grained detail and global semantic features, respectively. These features are subsequently fused and enhanced by the Compositor to produce the final representation.
  • Figure 3: Module architecture of our proposed DetailFusion. (a) Illustration of the shared structure between the DI and GM branches. A hybrid-modal encoder separately encodes the query and image, corresponding to the left and right branches, respectively. The image encoder shares parameters with the DI branch, but non-shared with the GM branch. (b) Illustration of the Adaptive Feature Compositor. Features from both branches first interact with fine-grained tokens from both the opposite and the same branch through cross-attention layers within the Fine-grained Feature Extraction Block, followed by a convex combination in the Global-Detail Feature Fusion Block.
  • Figure 4: Analysis of different values of trade-off hyper-parameter$\mathbf{\gamma}$ during joint fine-tuning.
  • Figure 5: Qualitative comparison of our method and the Baseline on the CIRR validation set. Images are arranged in descending order from left to right based on similarity to the multimodal query. The green boxes highlight the target image, while all non-target images are marked with red boxes. In the subset retrieval results, we report the relative similarity score, which defined as the normalized similarity between each retrieved image and the multimodal query, relative to all images in the gallery.
  • ...and 1 more figures