Table of Contents
Fetching ...

CaLa: Complementary Association Learning for Augmenting Composed Image Retrieval

Xintong Jiang, Yaxiong Wang, Mengjian Li, Yujiao Wu, Bingwen Hu, Xueming Qian

TL;DR

CaLa addresses composed image retrieval by revealing two complementary associations within CIR triplets and integrating them with the explicit query–target relation. It introduces TBIA, a hinge-based cross-attention mechanism that aligns reference and target images under the guidance of complementary text, and CTR, a twin-attention visual compositor that reasons about text from fused images. The model optimizes a triple-loss objective $L = L_{QTM} + \alpha L_{TBIA} + \beta L_{CTR}$ and delivers state-of-the-art results on CIRR and FashionIQ across multiple backbones, with notable improvements in both low-rank and fine-grained settings. By exploiting implicit cross-modal relations during training only, CaLa achieves significant performance gains without increasing inference cost, offering a practical path to more precise CIR in real-world search and recommendation systems.

Abstract

Composed Image Retrieval (CIR) involves searching for target images based on an image-text pair query. While current methods treat this as a query-target matching problem, we argue that CIR triplets contain additional associations beyond this primary relation. In our paper, we identify two new relations within triplets, treating each triplet as a graph node. Firstly, we introduce the concept of text-bridged image alignment, where the query text serves as a bridge between the query image and the target image. We propose a hinge-based cross-attention mechanism to incorporate this relation into network learning. Secondly, we explore complementary text reasoning, considering CIR as a form of cross-modal retrieval where two images compose to reason about complementary text. To integrate these perspectives effectively, we design a twin attention-based compositor. By combining these complementary associations with the explicit query pair-target image relation, we establish a comprehensive set of constraints for CIR. Our framework, CaLa (Complementary Association Learning for Augmenting Composed Image Retrieval), leverages these insights. We evaluate CaLa on CIRR and FashionIQ benchmarks with multiple backbones, demonstrating its superiority in composed image retrieval.

CaLa: Complementary Association Learning for Augmenting Composed Image Retrieval

TL;DR

CaLa addresses composed image retrieval by revealing two complementary associations within CIR triplets and integrating them with the explicit query–target relation. It introduces TBIA, a hinge-based cross-attention mechanism that aligns reference and target images under the guidance of complementary text, and CTR, a twin-attention visual compositor that reasons about text from fused images. The model optimizes a triple-loss objective and delivers state-of-the-art results on CIRR and FashionIQ across multiple backbones, with notable improvements in both low-rank and fine-grained settings. By exploiting implicit cross-modal relations during training only, CaLa achieves significant performance gains without increasing inference cost, offering a practical path to more precise CIR in real-world search and recommendation systems.

Abstract

Composed Image Retrieval (CIR) involves searching for target images based on an image-text pair query. While current methods treat this as a query-target matching problem, we argue that CIR triplets contain additional associations beyond this primary relation. In our paper, we identify two new relations within triplets, treating each triplet as a graph node. Firstly, we introduce the concept of text-bridged image alignment, where the query text serves as a bridge between the query image and the target image. We propose a hinge-based cross-attention mechanism to incorporate this relation into network learning. Secondly, we explore complementary text reasoning, considering CIR as a form of cross-modal retrieval where two images compose to reason about complementary text. To integrate these perspectives effectively, we design a twin attention-based compositor. By combining these complementary associations with the explicit query pair-target image relation, we establish a comprehensive set of constraints for CIR. Our framework, CaLa (Complementary Association Learning for Augmenting Composed Image Retrieval), leverages these insights. We evaluate CaLa on CIRR and FashionIQ benchmarks with multiple backbones, demonstrating its superiority in composed image retrieval.
Paper Structure (14 sections, 12 equations, 7 figures, 4 tables)

This paper contains 14 sections, 12 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The illustration of the popular explicit association (a) and our disclosed two complementary associations in this paper (b)-(c), where the left part and right part separately show the association and architecture for integration of the constraints. The explicit association in (a) is typically modeled as the query-target matching problem. By considering the triplet as a graph node, we disclose two new associations of text-bridged image alignment (b) and complementary text reasoning (c) and integrate them into network learning via a hinge-based cross-attention and twin attention-based compositor.
  • Figure 2: Illustration of our CaLa architecture. CaLa is a two-branch architecture, where a multimodal branch and image branch serve for the query and target image feature extraction, respectively. Given a query pair and the matched target image, their features are first extracted with respective encoders. With these representations, the proposed hinge-based cross-attention (HCA) module and twin attention-based vision compositor (TAC) module are equipped on the top of the base encoders, imposing the two complementary associations. Note that the data flows for our complementary association integration are only applied in the traning stage (dashed boxes), introducing no inference burden.
  • Figure 3: The illustration of our hinge-based cross attention. The output of this module can be viewed as a query result from the reference image to the target image, which can be used in the alignment to the reference image.
  • Figure 4: The illustration of our Twin Attention-based Vision Compositor. The two branches of cross-attention layers do not share weights and the mean of two CLS tokens is seen as the textual features' counterpart, which is used to reason the complementary text.
  • Figure 5: Qualitative results on CIRR validation dataset. We show the results of both baseline solely and with CaLa for a clear comparison: BLIP2Cir vs$\text{CaLa}_\text{BLIP2Cir}$, and ARTEMIS vs$\text{CaLa}_\text{ARTEMIS}$. Images in red boxes are the target images responding to the query pair. We can find that the target image can be identified more accurately when our CaLa is equipped.
  • ...and 2 more figures