Table of Contents
Fetching ...

Composed Object Retrieval: Object-level Retrieval via Composed Expressions

Tong Wang, Guanyu Yang, Nian Liu, Zongyan Han, Jinxing Zhou, Salman Khan, Fahad Shahbaz Khan

TL;DR

This work introduces Composed Object Retrieval (COR), a task that localizes target objects within images using composed expressions formed from a reference object and retrieval text. It presents COR127K, a large-scale, fully automated dataset of 127K triplets across 408 categories, and CORE, an end-to-end model that combines Reference Region Embedding (RRE), Adaptive Vision-Text Interaction (AVTI), and a region-focused contrastive loss to achieve precise object localization and segmentation. CORE demonstrates state-of-the-art performance on COR127K for both base and novel categories, significantly outperforming existing CIR baselines and showing strong generalization. The dataset and model are publicly released, offering a new benchmark and strong baseline for fine-grained, object-level multi-modal retrieval and grounding.

Abstract

Retrieving fine-grained visual content based on user intent remains a challenge in multi-modal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a brand-new task that goes beyond image-level retrieval to achieve object-level precision, allowing the retrieval and segmentation of target objects based on composed expressions combining reference objects and retrieval texts. COR presents significant challenges in retrieval flexibility, which requires systems to identify arbitrary objects satisfying composed expressions while avoiding semantically similar but irrelevant negative objects within the same scene. We construct COR127K, the first large-scale COR benchmark that contains 127,166 retrieval triplets with various semantic transformations in 408 categories. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning. Extensive experiments demonstrate that CORE significantly outperforms existing models in both base and novel categories, establishing a simple and effective baseline for this challenging task while opening new directions for fine-grained multi-modal retrieval research. We will publicly release both the dataset and the model at https://github.com/wangtong627/COR.

Composed Object Retrieval: Object-level Retrieval via Composed Expressions

TL;DR

This work introduces Composed Object Retrieval (COR), a task that localizes target objects within images using composed expressions formed from a reference object and retrieval text. It presents COR127K, a large-scale, fully automated dataset of 127K triplets across 408 categories, and CORE, an end-to-end model that combines Reference Region Embedding (RRE), Adaptive Vision-Text Interaction (AVTI), and a region-focused contrastive loss to achieve precise object localization and segmentation. CORE demonstrates state-of-the-art performance on COR127K for both base and novel categories, significantly outperforming existing CIR baselines and showing strong generalization. The dataset and model are publicly released, offering a new benchmark and strong baseline for fine-grained, object-level multi-modal retrieval and grounding.

Abstract

Retrieving fine-grained visual content based on user intent remains a challenge in multi-modal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a brand-new task that goes beyond image-level retrieval to achieve object-level precision, allowing the retrieval and segmentation of target objects based on composed expressions combining reference objects and retrieval texts. COR presents significant challenges in retrieval flexibility, which requires systems to identify arbitrary objects satisfying composed expressions while avoiding semantically similar but irrelevant negative objects within the same scene. We construct COR127K, the first large-scale COR benchmark that contains 127,166 retrieval triplets with various semantic transformations in 408 categories. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive visual-textual interaction, and region-level contrastive learning. Extensive experiments demonstrate that CORE significantly outperforms existing models in both base and novel categories, establishing a simple and effective baseline for this challenging task while opening new directions for fine-grained multi-modal retrieval research. We will publicly release both the dataset and the model at https://github.com/wangtong627/COR.

Paper Structure

This paper contains 35 sections, 15 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of COR task, which retrieves arbitrary target objects from the target image with candidate objects using composed expressions. It enables fine-grained object-level retrieval, distinguishing targets (i.e., light-colored doughnuts) from negatives (i.e., dark-colored ones). The retrieval text (i.e., “change the color to light”) specifies attribute changes, allowing flexible retrieval based on the reference object and text, without requiring explicit target object names (i.e., doughnut), thus supporting effective retrieval even when object categories are difficult to describe.
  • Figure 2: Statistics of the COR127K. (a) Image-resolution distribution, where colors represent resolution-ratio intervals, highlighting scale diversity. (b) Object-to-image area ratio, with colors indicating subsets for direct comparison of area-ratio distributions. (c) Retrieval text word cloud and (d) category word cloud, where word sizes reflect frequency, illustrating common text expressions and dominant categories.
  • Figure 3: Architecture of our proposed CORE model, which comprises: the Reference Region Embedding (RRE) module, the Adaptive Vision-Text Interaction (AVTI) module, and a COR-oriented contrastive loss $\mathcal{L}_{cor}$.
  • Figure 4: Qualitative results. From left to right: (1) Reference Image $I_{ref}$; (2) Reference Object $O_{ref}$; (3) Target Image $I_{tar}$; (4) Target Object $O_{tar}$; (5) Ours; (6) CLIP4CIR; (7) BLIP4CIR; (8) BLIP24CIR; (9) CLIP4CIR-SPN; (10) BLIP24CIR-SPN; (11) FineCIR.
  • Figure S1: The pipeline of combining the Detection Model, CIR Model, and Segmentation Model to the COR task.
  • ...and 4 more figures