Table of Contents
Fetching ...

IntRec: Intent-based Retrieval with Contrastive Refinement

Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, Yue Lu

TL;DR

IntRec addresses ambiguity in open-world object retrieval by introducing an Intent State ${IS_t}$ that stores positive anchors ${Z_{pos}^{(t)}}$ and negative constraints ${Z_{neg}^{(t)}}$, and a contrastive score ${S}(r_j|IS_t) = \max_{z^+ \in {Z}_{pos}^{(t)}} \mathrm{cos}(r_j, z^+) - \lambda \max_{z^- \in {Z}_{neg}^{(t)}} \mathrm{cos}(r_j, z^-)$. The state evolves through user feedback, adding regions to ${Z_{pos}}$ or ${Z_{neg}}$ to refine rankings in subsequent turns. The authors provide a theoretical guarantee that the contrastive mechanism resolves ambiguity under suitable ${\lambda}$, and demonstrate strong empirical gains on LVIS, LVIS-Ambiguous, and zero-shot transfer benchmarks with minimal latency per interaction. Overall, IntRec offers a practical, interactive alternative to one-shot open-vocabulary detectors, enabling precise localization in cluttered scenes with few feedback iterations.

Abstract

Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.

IntRec: Intent-based Retrieval with Contrastive Refinement

TL;DR

IntRec addresses ambiguity in open-world object retrieval by introducing an Intent State that stores positive anchors and negative constraints , and a contrastive score . The state evolves through user feedback, adding regions to or to refine rankings in subsequent turns. The authors provide a theoretical guarantee that the contrastive mechanism resolves ambiguity under suitable , and demonstrate strong empirical gains on LVIS, LVIS-Ambiguous, and zero-shot transfer benchmarks with minimal latency per interaction. Overall, IntRec offers a practical, interactive alternative to one-shot open-vocabulary detectors, enabling precise localization in cluttered scenes with few feedback iterations.

Abstract

Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.
Paper Structure (21 sections, 5 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 5 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the proposed IntRec. It identifies a user-specified object through an interactive loop. An initial query is encoded to initialize the intent state, which containes both positive exemplars $(Z_{pos}^{(t)})$ and negative exemplars $(Z_{neg}^{(t)})$. This state guides the contrastive scoring module, which ranks all candidate regions in the target image based on their similarity to the positive exemplars and dissimilarity to the negative ones. The feedback updates the intent state, enabling the model to accurately localize the target object $(b^\star)$.
  • Figure 2: Qualitative examples of our proposed model detecting rare categories in the LVIS validation set using textual prompt.
  • Figure 3: Localization comparison. While baseline models produce diffuse or imprecise heatmaps, our model generates sharp, accurate localizations that correctly ground all semantic components of the prompt. For instance, it correctly distinguishes the green apples from other fruit and localizes the zebra despite the presence of nearby giraffes.
  • Figure 4: Detection comparison in cluttered scenes. Blue boxes indicate detections of novel/rare categories, red boxes indicate detections of known (base) categories. While the baseline models frequently produce a redundant and overlapping detections for novel objects, our model generates accurate predictions, suppress duplicate detections in dense environments.
  • Figure 5: Analysis of Local vs. Global alignment on the LVIS-Ambiguous benchmark over two interactive turns. Both models start with a low AP at Turn 0 (initial prompt). However, after the first corrective feedback (Turn 1), our model increases by +7.9, outpacing the +4.3 gain of the global baseline.
  • ...and 2 more figures