Table of Contents
Fetching ...

VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

Dahun Kim, AJ Piergiovanni, Ganesh Mallya, Anelia Angelova

TL;DR

VideoComp, a benchmark and learning framework for advancing video-text compositionality understanding, is introduced, aimed at improving vision-language models (VLMs) in fine-grained temporal alignment, with a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and gradually penalizes increasingly disrupted ones, encouraging fine-grained compositional learning.

Abstract

We introduce VideoComp, a benchmark and learning framework for advancing video-text compositionality understanding, aimed at improving vision-language models (VLMs) in fine-grained temporal alignment. Unlike existing benchmarks focused on static image-text compositionality or isolated single-event videos, our benchmark targets alignment in continuous multi-event videos. Leveraging video-text datasets with temporally localized event captions (e.g. ActivityNet-Captions, YouCook2), we construct two compositional benchmarks, ActivityNet-Comp and YouCook2-Comp. We create challenging negative samples with subtle temporal disruptions such as reordering, action word replacement, partial captioning, and combined disruptions. These benchmarks comprehensively test models' compositional sensitivity across extended, cohesive video-text sequences. To improve model performance, we propose a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and gradually penalizes increasingly disrupted ones, encouraging fine-grained compositional learning. To mitigate the limited availability of densely annotated video data, we introduce a pretraining strategy that concatenates short video-caption pairs to simulate multi-event sequences. We evaluate video-text foundational models and large multimodal models (LMMs) on our benchmark, identifying both strengths and areas for improvement in compositionality. Overall, our work provides a comprehensive framework for evaluating and enhancing model capabilities in achieving fine-grained, temporally coherent video-text alignment.

VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

TL;DR

VideoComp, a benchmark and learning framework for advancing video-text compositionality understanding, is introduced, aimed at improving vision-language models (VLMs) in fine-grained temporal alignment, with a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and gradually penalizes increasingly disrupted ones, encouraging fine-grained compositional learning.

Abstract

We introduce VideoComp, a benchmark and learning framework for advancing video-text compositionality understanding, aimed at improving vision-language models (VLMs) in fine-grained temporal alignment. Unlike existing benchmarks focused on static image-text compositionality or isolated single-event videos, our benchmark targets alignment in continuous multi-event videos. Leveraging video-text datasets with temporally localized event captions (e.g. ActivityNet-Captions, YouCook2), we construct two compositional benchmarks, ActivityNet-Comp and YouCook2-Comp. We create challenging negative samples with subtle temporal disruptions such as reordering, action word replacement, partial captioning, and combined disruptions. These benchmarks comprehensively test models' compositional sensitivity across extended, cohesive video-text sequences. To improve model performance, we propose a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and gradually penalizes increasingly disrupted ones, encouraging fine-grained compositional learning. To mitigate the limited availability of densely annotated video data, we introduce a pretraining strategy that concatenates short video-caption pairs to simulate multi-event sequences. We evaluate video-text foundational models and large multimodal models (LMMs) on our benchmark, identifying both strengths and areas for improvement in compositionality. Overall, our work provides a comprehensive framework for evaluating and enhancing model capabilities in achieving fine-grained, temporally coherent video-text alignment.

Paper Structure

This paper contains 23 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Illustration of our video-text compositionality benchmark, introducing challenging disruptions for fine-grained alignment. Starting from a video and its positive text description (top row), we apply subtle disruptions including temporal reordering (purple), action word replacement (orange), and segment-level mismatch (yellow text matched with blue video crop) for evaluation, along with combined disruptions used during training. These perturbations test the model’s ability to distinguish coherent video-text pairs from disrupted ones.
  • Figure 2: Overview of the dataset construction process. We start from a list of dense, multi-event captions of each video to obtain the positive text, then generate various negative texts with compositional disruptions.
  • Figure 3: CompPretrain strategy simulates long-form video-text sequences by concatenating short video-text pairs (1)-(4). This enables the use of temporally disrupted sequences and composition-aware learning in video pretraining with the simulated data.
  • Figure 4: Comparison of VidCLIP-base and VidCLIP-final on our benchmark dataset, with their confidence scores in selecting the positive text (A) over various negative samples (B, C, D, E). Temporal reordering (purple), action replacement (orange), segment mismatch (text in yellow box matched with video crops in blue), and multiple disruptions. VidCLIP-final consistently achieves higher scores, across different compositional disruptions.