IntRec: Intent-based Retrieval with Contrastive Refinement
Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, Yue Lu
TL;DR
IntRec addresses ambiguity in open-world object retrieval by introducing an Intent State ${IS_t}$ that stores positive anchors ${Z_{pos}^{(t)}}$ and negative constraints ${Z_{neg}^{(t)}}$, and a contrastive score ${S}(r_j|IS_t) = \max_{z^+ \in {Z}_{pos}^{(t)}} \mathrm{cos}(r_j, z^+) - \lambda \max_{z^- \in {Z}_{neg}^{(t)}} \mathrm{cos}(r_j, z^-)$. The state evolves through user feedback, adding regions to ${Z_{pos}}$ or ${Z_{neg}}$ to refine rankings in subsequent turns. The authors provide a theoretical guarantee that the contrastive mechanism resolves ambiguity under suitable ${\lambda}$, and demonstrate strong empirical gains on LVIS, LVIS-Ambiguous, and zero-shot transfer benchmarks with minimal latency per interaction. Overall, IntRec offers a practical, interactive alternative to one-shot open-vocabulary detectors, enabling precise localization in cluttered scenes with few feedback iterations.
Abstract
Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.
