Table of Contents
Fetching ...

Interpretable Embedding for Ad-hoc Video Search

Jiaxin Wu, Chong-Wah Ngo

TL;DR

This paper tackles the lack of interpretability in ad-hoc video search (AVS) by proposing a dual-task network that jointly learns a cross-modal embedding for retrieval and a multi-label concept decoder for interpretation. The framework shares a visual encoder between tasks and employs a class-sensitive BCE loss to robustly decode concepts, enabling the embedding-based results to be explained via decoded concepts and refined via query concepts. Empirically, it achieves state-of-the-art TRECVid results, with substantial gains from late fusion of embedding and concept-based searches, and demonstrates interpretability and Boolean-query handling to support interactive search workflows. The work highlights the complementary strengths of concept-based and concept-free approaches, showing how interpretability can be infused into black-box embedding methods to improve retrieval and user understanding in AVS.

Abstract

Answering query with semantic concepts has long been the mainstream approach for video search. Until recently, its performance is surpassed by concept-free approach, which embeds queries in a joint space as videos. Nevertheless, the embedded features as well as search results are not interpretable, hindering subsequent steps in video browsing and query reformulation. This paper integrates feature embedding and concept interpretation into a neural network for unified dual-task learning. In this way, an embedding is associated with a list of semantic concepts as an interpretation of video content. This paper empirically demonstrates that, by using either the embedding features or concepts, considerable search improvement is attainable on TRECVid benchmarked datasets. Concepts are not only effective in pruning false positive videos, but also highly complementary to concept-free search, leading to large margin of improvement compared to state-of-the-art approaches.

Interpretable Embedding for Ad-hoc Video Search

TL;DR

This paper tackles the lack of interpretability in ad-hoc video search (AVS) by proposing a dual-task network that jointly learns a cross-modal embedding for retrieval and a multi-label concept decoder for interpretation. The framework shares a visual encoder between tasks and employs a class-sensitive BCE loss to robustly decode concepts, enabling the embedding-based results to be explained via decoded concepts and refined via query concepts. Empirically, it achieves state-of-the-art TRECVid results, with substantial gains from late fusion of embedding and concept-based searches, and demonstrates interpretability and Boolean-query handling to support interactive search workflows. The work highlights the complementary strengths of concept-based and concept-free approaches, showing how interpretability can be infused into black-box embedding methods to improve retrieval and user understanding in AVS.

Abstract

Answering query with semantic concepts has long been the mainstream approach for video search. Until recently, its performance is surpassed by concept-free approach, which embeds queries in a joint space as videos. Nevertheless, the embedded features as well as search results are not interpretable, hindering subsequent steps in video browsing and query reformulation. This paper integrates feature embedding and concept interpretation into a neural network for unified dual-task learning. In this way, an embedding is associated with a list of semantic concepts as an interpretation of video content. This paper empirically demonstrates that, by using either the embedding features or concepts, considerable search improvement is attainable on TRECVid benchmarked datasets. Concepts are not only effective in pruning false positive videos, but also highly complementary to concept-free search, leading to large margin of improvement compared to state-of-the-art approaches.
Paper Structure (13 sections, 13 equations, 4 figures, 4 tables)

This paper contains 13 sections, 13 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: An overview of the end-to-end dual-task network architecture
  • Figure 2: The AVS performance comparison between the normal BCE loss and our proposed BCE loss.
  • Figure 3: Sensitivity of hyper-parameter $\theta$ in the late fusion of embedding-only and concept-only searches.
  • Figure 4: Visualization showing (a) the improvement of $DT_{combined}$ over $DT_{embedding}$ and (b) composition of true positives in $DT_{combined}$ within the search depth of 1,000. The x-axis shows the query topic ID. Each column in (b) visualizes the true positives that a method can retrieve with a different color.