Table of Contents
Fetching ...

FBCIR: Balancing Cross-Modal Focuses in Composed Image Retrieval

Chenchen Zhao, Jianhuan Zhuo, Muxi Chen, Zhaohua Zhang, Wenyu Jiang, Tianwen Jiang, Qiuyong Xiao, Jihong Zhang, Qiang Xu

Abstract

Composed image retrieval (CIR) requires multi-modal models to jointly reason over visual content and semantic modifications presented in text-image input pairs. While current CIR models achieve strong performance on common benchmark cases, their accuracies often degrades in more challenging scenarios where negative candidates are semantically aligned with the query image or text. In this paper, we attribute this degradation to focus imbalances, where models disproportionately attend to one modality while neglecting the other. To validate this claim, we propose FBCIR, a multi-modal focus interpretation method that identifies the most crucial visual and textual input components to a model's retrieval decisions. Using FBCIR, we report that focus imbalances are prevalent in existing CIR models, especially under hard negative settings. Building on the analyses, we further propose a CIR data augmentation workflow that facilitates existing CIR datasets with curated hard negatives designed to encourage balanced cross-modal reasoning. Extensive experiments across multiple CIR models demonstrate that the proposed augmentation consistently improves performance in challenging cases, while maintaining their capabilities on standard benchmarks. Together, our interpretation method and data augmentation workflow provide a new perspective on CIR model diagnosis and robustness improvements.

FBCIR: Balancing Cross-Modal Focuses in Composed Image Retrieval

Abstract

Composed image retrieval (CIR) requires multi-modal models to jointly reason over visual content and semantic modifications presented in text-image input pairs. While current CIR models achieve strong performance on common benchmark cases, their accuracies often degrades in more challenging scenarios where negative candidates are semantically aligned with the query image or text. In this paper, we attribute this degradation to focus imbalances, where models disproportionately attend to one modality while neglecting the other. To validate this claim, we propose FBCIR, a multi-modal focus interpretation method that identifies the most crucial visual and textual input components to a model's retrieval decisions. Using FBCIR, we report that focus imbalances are prevalent in existing CIR models, especially under hard negative settings. Building on the analyses, we further propose a CIR data augmentation workflow that facilitates existing CIR datasets with curated hard negatives designed to encourage balanced cross-modal reasoning. Extensive experiments across multiple CIR models demonstrate that the proposed augmentation consistently improves performance in challenging cases, while maintaining their capabilities on standard benchmarks. Together, our interpretation method and data augmentation workflow provide a new perspective on CIR model diagnosis and robustness improvements.
Paper Structure (28 sections, 4 equations, 5 figures, 15 tables)

This paper contains 28 sections, 4 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Models trained on common-case data learn "shortcuts" to get correct common-case retrieval results, but tend to fail on hard cases that require balanced cross-modal focuses. In this paper, we try to solve this issue by constructing targeted hard negatives. Original positives that are not highly consistent with the query are also considered negative in the proposed framework and replaced by more consistent synthetic images.
  • Figure 2: The overall framework of FBCIR, including a multi-modal model focus interpretation method and a dataset augmentation workflow. Given CIR triplets, the focus interpretation method highlights specific image segments and instruction keywords as the model's focuses, and reveals possible focus imbalances. The dataset augmentation framework facilitates existing CIR triplets with crafted hard negatives that encourage more balanced focuses. The focus interpretation module serves as the problem indicator and post-hoc validator of the data augmentation module.
  • Figure 3: The multi-modal iterative focus refinement process of FBCIR.
  • Figure 4: Examples of the data constructed by the FBCIR-Data workflow. For the real-life dataset MegaPairs, we synthesize a positive for each triplet, and regard the original positive a special candidate.
  • Figure 5: Visualizations of image and text focus balance ratios, and examples with semantics of a specific modality completely ignored by the models.