Table of Contents
Fetching ...

TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm

Bingqing Zhang, Zhuo Cao, Heming Du, Xin Yu, Xue Li, Jiajun Liu, Sen Wang

TL;DR

This innovative two-stage TVR framework introduces a novel one-to-many coarse-to-fine alignment paradigm, imitating the human cognitive process of identifying specific items within a large collection, and substantially outperforms existing state-of-the-art methods.

Abstract

Text-Video Retrieval (TVR) methods typically match query-candidate pairs by aligning text and video features in coarse-grained, fine-grained, or combined (coarse-to-fine) manners. However, these frameworks predominantly employ a one(query)-to-one(candidate) alignment paradigm, which struggles to discern nuanced differences among candidates, leading to frequent mismatches. Inspired by Comparative Judgement in human cognitive science, where decisions are made by directly comparing items rather than evaluating them independently, we propose TokenBinder. This innovative two-stage TVR framework introduces a novel one-to-many coarse-to-fine alignment paradigm, imitating the human cognitive process of identifying specific items within a large collection. Our method employs a Focused-view Fusion Network with a sophisticated cross-attention mechanism, dynamically aligning and comparing features across multiple videos to capture finer nuances and contextual variations. Extensive experiments on six benchmark datasets confirm that TokenBinder substantially outperforms existing state-of-the-art methods. These results demonstrate its robustness and the effectiveness of its fine-grained alignment in bridging intra- and inter-modality information gaps in TVR tasks.

TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm

TL;DR

This innovative two-stage TVR framework introduces a novel one-to-many coarse-to-fine alignment paradigm, imitating the human cognitive process of identifying specific items within a large collection, and substantially outperforms existing state-of-the-art methods.

Abstract

Text-Video Retrieval (TVR) methods typically match query-candidate pairs by aligning text and video features in coarse-grained, fine-grained, or combined (coarse-to-fine) manners. However, these frameworks predominantly employ a one(query)-to-one(candidate) alignment paradigm, which struggles to discern nuanced differences among candidates, leading to frequent mismatches. Inspired by Comparative Judgement in human cognitive science, where decisions are made by directly comparing items rather than evaluating them independently, we propose TokenBinder. This innovative two-stage TVR framework introduces a novel one-to-many coarse-to-fine alignment paradigm, imitating the human cognitive process of identifying specific items within a large collection. Our method employs a Focused-view Fusion Network with a sophisticated cross-attention mechanism, dynamically aligning and comparing features across multiple videos to capture finer nuances and contextual variations. Extensive experiments on six benchmark datasets confirm that TokenBinder substantially outperforms existing state-of-the-art methods. These results demonstrate its robustness and the effectiveness of its fine-grained alignment in bridging intra- and inter-modality information gaps in TVR tasks.
Paper Structure (17 sections, 8 equations, 3 figures, 6 tables)

This paper contains 17 sections, 8 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An illustration of alignment methods in text-video retrieval, categorized into three types: coarse-grained (a), fine-grained (b), and coarse-to-fine grained alignment (c). Traditional methods typically employ a one-to-one alignment paradigm. In contrast, we introduce a one-to-many coarse-to-fine grained alignment approach, allowing each query to be compared with multiple video candidates (d). This method facilitates mining differences among candidates to achieve enhanced retrieval effectiveness.
  • Figure 2: TokenBinder Framework for Text-Video Retrieval. The diagram showcases the complete workflow of our dual-stage retrieval system. Initially, the query is processed using intra-modality cross attention to bind significant query indicators with textual features, shown in the green section. The broad-view retrieval then ranks video candidates based on their global features using cosine similarity and contrastive loss, illustrated in middle section. The top-ranked candidates are further refined in the focused-view retrieval through inter-modality cross attention and MLP-based similarity scoring, as depicted in the top section. This process ensures comprehensive text-video alignment and optimizes retrieval accuracy.
  • Figure 3: Example of Text-to-Video Retrieval produced by TokenBinder and CLIP-ViP. A green tick signifies successful retrieval of the accurate video, while a red cross denotes an erroneous retrieval outcome.