Table of Contents
Fetching ...

Robust Relevance Feedback for Interactive Known-Item Video Search

Zhixin Ma, Chong-Wah Ngo

TL;DR

This paper tackles known-item video search (KIS) where user feedback is often misaligned with machine similarity, by combining pairwise relative judgments with a decomposed set of sub-perceptions. A predictive model estimates which sub-perceptions a user relies on for each feedback, enabling a soft Bayesian update that weighs sub-perception contributions by confidence scores and a distance-aware embedding, followed by optional search-space pruning to reduce drift. The approach yields substantial robustness gains on large-scale datasets (V3C1/V3C2) and social-media-style textual KIS (VBS t-KIS), with Recall@1 improvements and faster convergence across varying initial target depths, including 60% top-1 success for shallow depths and 40% for deeper ones, when pruning is employed. Overall, the method demonstrates that explicitly modeling user subjectivity and filtering misaligned sub-perceptions can meaningfully improve interactive KIS performance in real-world, large-scale video retrieval scenarios.

Abstract

Known-item search (KIS) involves only a single search target, making relevance feedback-typically a powerful technique for efficiently identifying multiple positive examples to infer user intent-inapplicable. PicHunter addresses this issue by asking users to select the top-k most similar examples to the unique search target from a displayed set. Under ideal conditions, when the user's perception aligns closely with the machine's perception of similarity, consistent and precise judgments can elevate the target to the top position within a few iterations. However, in practical scenarios, expecting users to provide consistent judgments is often unrealistic, especially when the underlying embedding features used for similarity measurements lack interpretability. To enhance robustness, we first introduce a pairwise relative judgment feedback that improves the stability of top-k selections by mitigating the impact of misaligned feedback. Then, we decompose user perception into multiple sub-perceptions, each represented as an independent embedding space. This approach assumes that users may not consistently align with a single representation but are more likely to align with one or several among multiple representations. We develop a predictive user model that estimates the combination of sub-perceptions based on each user feedback instance. The predictive user model is then trained to filter out the misaligned sub-perceptions. Experimental evaluations on the large-scale open-domain dataset V3C indicate that the proposed model can optimize over 60% search targets to the top rank when their initial ranks at the search depth between 10 and 50. Even for targets initially ranked between 1,000 and 5,000, the model achieves a success rate exceeding 40% in optimizing ranks to the top, demonstrating the enhanced robustness of relevance feedback in KIS despite inconsistent feedback.

Robust Relevance Feedback for Interactive Known-Item Video Search

TL;DR

This paper tackles known-item video search (KIS) where user feedback is often misaligned with machine similarity, by combining pairwise relative judgments with a decomposed set of sub-perceptions. A predictive model estimates which sub-perceptions a user relies on for each feedback, enabling a soft Bayesian update that weighs sub-perception contributions by confidence scores and a distance-aware embedding, followed by optional search-space pruning to reduce drift. The approach yields substantial robustness gains on large-scale datasets (V3C1/V3C2) and social-media-style textual KIS (VBS t-KIS), with Recall@1 improvements and faster convergence across varying initial target depths, including 60% top-1 success for shallow depths and 40% for deeper ones, when pruning is employed. Overall, the method demonstrates that explicitly modeling user subjectivity and filtering misaligned sub-perceptions can meaningfully improve interactive KIS performance in real-world, large-scale video retrieval scenarios.

Abstract

Known-item search (KIS) involves only a single search target, making relevance feedback-typically a powerful technique for efficiently identifying multiple positive examples to infer user intent-inapplicable. PicHunter addresses this issue by asking users to select the top-k most similar examples to the unique search target from a displayed set. Under ideal conditions, when the user's perception aligns closely with the machine's perception of similarity, consistent and precise judgments can elevate the target to the top position within a few iterations. However, in practical scenarios, expecting users to provide consistent judgments is often unrealistic, especially when the underlying embedding features used for similarity measurements lack interpretability. To enhance robustness, we first introduce a pairwise relative judgment feedback that improves the stability of top-k selections by mitigating the impact of misaligned feedback. Then, we decompose user perception into multiple sub-perceptions, each represented as an independent embedding space. This approach assumes that users may not consistently align with a single representation but are more likely to align with one or several among multiple representations. We develop a predictive user model that estimates the combination of sub-perceptions based on each user feedback instance. The predictive user model is then trained to filter out the misaligned sub-perceptions. Experimental evaluations on the large-scale open-domain dataset V3C indicate that the proposed model can optimize over 60% search targets to the top rank when their initial ranks at the search depth between 10 and 50. Even for targets initially ranked between 1,000 and 5,000, the model achieves a success rate exceeding 40% in optimizing ranks to the top, demonstrating the enhanced robustness of relevance feedback in KIS despite inconsistent feedback.

Paper Structure

This paper contains 22 sections, 6 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Framework of the proposed relevance feedback system. The initial query for the search target is "a woman in a red dress sitting at a desk".
  • Figure 2: Architecture of the proposed predictive model $f_P^{pred}$. For simplicity, we annotate the time step at the bottom instead of on the symbols.
  • Figure 3: Illustration of the shift in search results. The initial query is "A man is walking down the street with a backpack". The search target is boosted to 62th rank from 244th rank after the second shot in the display panel is selected as feedback. Nevertheless, the ranks of irrelevant images, which are similar to the selected shot but not the query, are also elevated.
  • Figure 4: Example of LLaVA-NeXT and BLIP2 captions.
  • Figure 5: Performance comparison of different caption sources.
  • ...and 2 more figures