ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams
Chris Dongjoo Kim, Jihwan Moon, Sangwoo Moon, Heeseung Yun, Sihaeng Lee, Aniruddha Kembhavi, Soonyoung Lee, Gunhee Kim, Sangho Lee, Christopher Clark
TL;DR
ReSpec addresses the challenge of learning from ever-growing video-text streams by online filtering data through three principled criteria: modality alignment, task relevance, and specificity, all under an efficiency constraint. It precomputes downstream task representations and uses a cross-modal similarity score, a vMF-KDE-based relevance measure, and a root-embedding distance to filter data in real time, enabling training with substantially less data. The approach achieves state-of-the-art zero-shot video retrieval on WebVid2M and VideoCC3M while using far less data than baselines, and demonstrates robustness to hyperparameters and broad generalization across architectures and tasks. The work highlights the practical impact of task-aware online data curation for scalable, responsive multimodal learning in resource-constrained environments.
Abstract
The rapid growth of video-text data presents challenges in storage and computation during training. Online learning, which processes streaming data in real-time, offers a promising solution to these issues while also allowing swift adaptations in scenarios demanding real-time responsiveness. One strategy to enhance the efficiency and effectiveness of learning involves identifying and prioritizing data that enhances performance on target downstream tasks. We propose Relevance and Specificity-based online filtering framework (ReSpec) that selects data based on four criteria: (i) modality alignment for clean data, (ii) task relevance for target focused data, (iii) specificity for informative and detailed data, and (iv) efficiency for low-latency processing. Relevance is determined by the probabilistic alignment of incoming data with downstream tasks, while specificity employs the distance to a root embedding representing the least specific data as an efficient proxy for informativeness. By establishing reference points from target task data, ReSpec filters incoming data in real-time, eliminating the need for extensive storage and compute. Evaluating on large-scale datasets WebVid2M and VideoCC3M, ReSpec attains state-of-the-art performance on five zeroshot video retrieval tasks, using as little as 5% of the data while incurring minimal compute. The source code is available at https://github.com/cdjkim/ReSpec.
