VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models
Jiapeng Wang, Chengyu Wang, Kunzhe Huang, Jun Huang, Lianwen Jin
TL;DR
VideoCLIP-XL addresses the gap in video-language models' ability to understand long textual descriptions. It introduces a pipeline that automatically collects a large VILD dataset of video-long descriptions, a Text-similarity-guided Primary Component Matching (TCPM) to adapt representation learning, and two description-ranking tasks (Detail-aware Description Ranking and Hallucination-aware Description Ranking) to shape faithful long-text understanding. It also introduces the LVDR benchmark for evaluating long-description ranking and demonstrates state-of-the-art performance on standard text-video retrieval benchmarks and the LVDR task. The work shows that combining large-scale long-description data with adaptive feature selection and targeted ranking tasks yields substantial gains for long-text video understanding.
Abstract
Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in numerous applications. However, the emphasis on brief summary texts during pre-training prevents CLIP from understanding long descriptions. This issue is particularly acute regarding videos given that videos often contain abundant detailed contents. In this paper, we propose the VideoCLIP-XL (eXtra Length) model, which aims to unleash the long-description understanding capability of video CLIP models. Firstly, we establish an automatic data collection system and gather a large-scale VILD pre-training dataset with VIdeo and Long-Description pairs. Then, we propose Text-similarity-guided Primary Component Matching (TPCM) to better learn the distribution of feature space while expanding the long description capability. We also introduce two new tasks namely Detail-aware Description Ranking (DDR) and Hallucination-aware Description Ranking (HDR) for further understanding improvement. Finally, we construct a Long Video Description Ranking (LVDR) benchmark for evaluating the long-description capability more comprehensively. Extensive experimental results on widely-used text-video retrieval benchmarks with both short and long descriptions and our LVDR benchmark can fully demonstrate the effectiveness of our method.
