Interpretable Embedding for Ad-hoc Video Search
Jiaxin Wu, Chong-Wah Ngo
TL;DR
This paper tackles the lack of interpretability in ad-hoc video search (AVS) by proposing a dual-task network that jointly learns a cross-modal embedding for retrieval and a multi-label concept decoder for interpretation. The framework shares a visual encoder between tasks and employs a class-sensitive BCE loss to robustly decode concepts, enabling the embedding-based results to be explained via decoded concepts and refined via query concepts. Empirically, it achieves state-of-the-art TRECVid results, with substantial gains from late fusion of embedding and concept-based searches, and demonstrates interpretability and Boolean-query handling to support interactive search workflows. The work highlights the complementary strengths of concept-based and concept-free approaches, showing how interpretability can be infused into black-box embedding methods to improve retrieval and user understanding in AVS.
Abstract
Answering query with semantic concepts has long been the mainstream approach for video search. Until recently, its performance is surpassed by concept-free approach, which embeds queries in a joint space as videos. Nevertheless, the embedded features as well as search results are not interpretable, hindering subsequent steps in video browsing and query reformulation. This paper integrates feature embedding and concept interpretation into a neural network for unified dual-task learning. In this way, an embedding is associated with a list of semantic concepts as an interpretation of video content. This paper empirically demonstrates that, by using either the embedding features or concepts, considerable search improvement is attainable on TRECVid benchmarked datasets. Concepts are not only effective in pruning false positive videos, but also highly complementary to concept-free search, leading to large margin of improvement compared to state-of-the-art approaches.
