Table of Contents
Fetching ...

ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams

Chris Dongjoo Kim, Jihwan Moon, Sangwoo Moon, Heeseung Yun, Sihaeng Lee, Aniruddha Kembhavi, Soonyoung Lee, Gunhee Kim, Sangho Lee, Christopher Clark

TL;DR

ReSpec addresses the challenge of learning from ever-growing video-text streams by online filtering data through three principled criteria: modality alignment, task relevance, and specificity, all under an efficiency constraint. It precomputes downstream task representations and uses a cross-modal similarity score, a vMF-KDE-based relevance measure, and a root-embedding distance to filter data in real time, enabling training with substantially less data. The approach achieves state-of-the-art zero-shot video retrieval on WebVid2M and VideoCC3M while using far less data than baselines, and demonstrates robustness to hyperparameters and broad generalization across architectures and tasks. The work highlights the practical impact of task-aware online data curation for scalable, responsive multimodal learning in resource-constrained environments.

Abstract

The rapid growth of video-text data presents challenges in storage and computation during training. Online learning, which processes streaming data in real-time, offers a promising solution to these issues while also allowing swift adaptations in scenarios demanding real-time responsiveness. One strategy to enhance the efficiency and effectiveness of learning involves identifying and prioritizing data that enhances performance on target downstream tasks. We propose Relevance and Specificity-based online filtering framework (ReSpec) that selects data based on four criteria: (i) modality alignment for clean data, (ii) task relevance for target focused data, (iii) specificity for informative and detailed data, and (iv) efficiency for low-latency processing. Relevance is determined by the probabilistic alignment of incoming data with downstream tasks, while specificity employs the distance to a root embedding representing the least specific data as an efficient proxy for informativeness. By establishing reference points from target task data, ReSpec filters incoming data in real-time, eliminating the need for extensive storage and compute. Evaluating on large-scale datasets WebVid2M and VideoCC3M, ReSpec attains state-of-the-art performance on five zeroshot video retrieval tasks, using as little as 5% of the data while incurring minimal compute. The source code is available at https://github.com/cdjkim/ReSpec.

ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams

TL;DR

ReSpec addresses the challenge of learning from ever-growing video-text streams by online filtering data through three principled criteria: modality alignment, task relevance, and specificity, all under an efficiency constraint. It precomputes downstream task representations and uses a cross-modal similarity score, a vMF-KDE-based relevance measure, and a root-embedding distance to filter data in real time, enabling training with substantially less data. The approach achieves state-of-the-art zero-shot video retrieval on WebVid2M and VideoCC3M while using far less data than baselines, and demonstrates robustness to hyperparameters and broad generalization across architectures and tasks. The work highlights the practical impact of task-aware online data curation for scalable, responsive multimodal learning in resource-constrained environments.

Abstract

The rapid growth of video-text data presents challenges in storage and computation during training. Online learning, which processes streaming data in real-time, offers a promising solution to these issues while also allowing swift adaptations in scenarios demanding real-time responsiveness. One strategy to enhance the efficiency and effectiveness of learning involves identifying and prioritizing data that enhances performance on target downstream tasks. We propose Relevance and Specificity-based online filtering framework (ReSpec) that selects data based on four criteria: (i) modality alignment for clean data, (ii) task relevance for target focused data, (iii) specificity for informative and detailed data, and (iv) efficiency for low-latency processing. Relevance is determined by the probabilistic alignment of incoming data with downstream tasks, while specificity employs the distance to a root embedding representing the least specific data as an efficient proxy for informativeness. By establishing reference points from target task data, ReSpec filters incoming data in real-time, eliminating the need for extensive storage and compute. Evaluating on large-scale datasets WebVid2M and VideoCC3M, ReSpec attains state-of-the-art performance on five zeroshot video retrieval tasks, using as little as 5% of the data while incurring minimal compute. The source code is available at https://github.com/cdjkim/ReSpec.

Paper Structure

This paper contains 35 sections, 6 equations, 16 figures, 13 tables.

Figures (16)

  • Figure 1: Online training with online filtering. Comparison of average performance across various methods for online filtering, shown as a function of the cumulative data samples used for training. Our proposed method consistently outperforms baseline approaches at any number of training iterations on WebVid2M bain21webvid, achieving the highest performance with minimal data requirements.
  • Figure 2: High-level comparison of offline and online filtered training. (a) Offline filtering approaches initially process the stored massive-scale data (often by scoring and ranking the entire data) and retain the filtered subset. The filtered data is then used during the training phase. (b) Online filtering performs real-time filtering, dynamically deciding whether to accept or discard samples. Accepted samples are immediately forwarded for online model training. Downstream task-aware online filtering, such as CiT xu2023cit, CoLoR-Filter brandfonbrener2024color-filter, and our ReSpec, utilizes downstream task data (e.g., embeddings) to guide the filtering process.
  • Figure 3: ReSpec: Relevance and Specificity based online filtering. (a) We precompute downstream embeddings by utilizing the target downstream task dataset. (b) Relevance is determined by evaluating how closely a data point aligns with the target downstream embedding distribution using density estimation. (c) Specificity is measured by comparing the relative distances of the incoming embedding and the downstream embedding to the root embedding (empty text embedding, i.e., " "). Video-text pairs that pass all of the alignment, relevance, and specificity filters are accepted for online model training.
  • Figure 4: Performance comparison. We compare our approach to the baselines based on the average performance and the ratio of filtered data size to full data size. The average performance is the average of Recall at 1, 5, and 10 across the five downstream tasks. Training on full datasets without filtering achieves an average performance of 47.23 on WebVid2M and 42.62 on VideoCC3M.
  • Figure 5: Multimodal alignment threshold analysis on WebVid2M. Our ReSpec filtering achieves the best performance overall across different video-text cosine similiarity threshold values.
  • ...and 11 more figures