Table of Contents
Fetching ...

VideoCon: Robust Video-Language Alignment via Contrast Captions

Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, Aditya Grover

TL;DR

This work tackles robustness gaps in video-language alignment by introducing VideoCon, a dataset of temporally challenging contrast captions and explanations generated by an LLM, complemented by human-crafted variants. Finetuning a video-language model (Owl-Con) on VideoCon yields substantial gains in both intrinsic entailment/NLE tasks and zero-shot downstream tasks (text-to-video retrieval and video QA), achieving state-of-the-art performance. The findings demonstrate that carefully constructed contrastive data can outperform naive scaling of pretraining data in improving alignment fidelity. VideoCon thus provides a scalable, data-efficient path toward more trustworthy video-language understanding in dynamic scenarios.

Abstract

Despite being (pre)trained on a massive amount of data, state-of-the-art video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions. Our work addresses this by identifying a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order, which alignment models should be robust against. To this end, we introduce the VideoCon, a video-language alignment dataset constructed by a large language model that generates plausible contrast video captions and explanations for differences between original and contrast video captions. Then, a generative video-language model is finetuned with VideoCon to assess video-language entailment and generate explanations. Our VideoCon-based alignment model significantly outperforms current models. It exhibits a 12-point increase in AUC for the video-language alignment task on human-generated contrast captions. Finally, our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks such as text-to-video retrieval (SSv2-Temporal) and video question answering (ATP-Hard). Moreover, our model shows superior performance on novel videos and human-crafted captions and explanations. Our code and data are available at https://github.com/Hritikbansal/videocon.

VideoCon: Robust Video-Language Alignment via Contrast Captions

TL;DR

This work tackles robustness gaps in video-language alignment by introducing VideoCon, a dataset of temporally challenging contrast captions and explanations generated by an LLM, complemented by human-crafted variants. Finetuning a video-language model (Owl-Con) on VideoCon yields substantial gains in both intrinsic entailment/NLE tasks and zero-shot downstream tasks (text-to-video retrieval and video QA), achieving state-of-the-art performance. The findings demonstrate that carefully constructed contrastive data can outperform naive scaling of pretraining data in improving alignment fidelity. VideoCon thus provides a scalable, data-efficient path toward more trustworthy video-language understanding in dynamic scenarios.

Abstract

Despite being (pre)trained on a massive amount of data, state-of-the-art video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions. Our work addresses this by identifying a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order, which alignment models should be robust against. To this end, we introduce the VideoCon, a video-language alignment dataset constructed by a large language model that generates plausible contrast video captions and explanations for differences between original and contrast video captions. Then, a generative video-language model is finetuned with VideoCon to assess video-language entailment and generate explanations. Our VideoCon-based alignment model significantly outperforms current models. It exhibits a 12-point increase in AUC for the video-language alignment task on human-generated contrast captions. Finally, our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks such as text-to-video retrieval (SSv2-Temporal) and video question answering (ATP-Hard). Moreover, our model shows superior performance on novel videos and human-crafted captions and explanations. Our code and data are available at https://github.com/Hritikbansal/videocon.
Paper Structure (43 sections, 1 equation, 19 figures, 7 tables)

This paper contains 43 sections, 1 equation, 19 figures, 7 tables.

Figures (19)

  • Figure 1: Overview of our VideoCon approach. First, aligned video-language pairs are filtered to retain temporally-challenging instances. Then contrast captions and natural language explanations (NLE) are generated by an LLM to create the VideoCon dataset. Second, a video-language alignment model is finetuned with VideoCon on the alignment and NLE tasks. Finally, the finetuned model is evaluated against the baseline model. Our results show that it outperforms the baseline, achieving SOTA results on downstream tasks.
  • Figure 2: Overview of the VideoCon data generation process from top to bottom. Specifically, we prompt a large language model (PaLM-2) with the original caption that is grounded in the video, and the intended type of misalignment within the contrast caption. We consider seven kinds of misalignments including object, action, attribute, counting, spatial relation, hallucination, and event order flip. We provide a generated contrast caption and the corresponding natural language explanation for each misalignment type.
  • Figure 3: Distribution of the types of misalignments within the contrast captions of the VideoCon dataset. We observe that the dataset has good representation for all the kinds of misalignments ranging from $8.8\%$ to $24.2\%$.
  • Figure 4: Entailment task prompt for finetuning.
  • Figure 5: NLE generation task prompt for finetuning.
  • ...and 14 more figures