VideoCon: Robust Video-Language Alignment via Contrast Captions

Hritik Bansal; Yonatan Bitton; Idan Szpektor; Kai-Wei Chang; Aditya Grover

VideoCon: Robust Video-Language Alignment via Contrast Captions

Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, Aditya Grover

TL;DR

This work tackles robustness gaps in video-language alignment by introducing VideoCon, a dataset of temporally challenging contrast captions and explanations generated by an LLM, complemented by human-crafted variants. Finetuning a video-language model (Owl-Con) on VideoCon yields substantial gains in both intrinsic entailment/NLE tasks and zero-shot downstream tasks (text-to-video retrieval and video QA), achieving state-of-the-art performance. The findings demonstrate that carefully constructed contrastive data can outperform naive scaling of pretraining data in improving alignment fidelity. VideoCon thus provides a scalable, data-efficient path toward more trustworthy video-language understanding in dynamic scenarios.

Abstract

Despite being (pre)trained on a massive amount of data, state-of-the-art video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions. Our work addresses this by identifying a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order, which alignment models should be robust against. To this end, we introduce the VideoCon, a video-language alignment dataset constructed by a large language model that generates plausible contrast video captions and explanations for differences between original and contrast video captions. Then, a generative video-language model is finetuned with VideoCon to assess video-language entailment and generate explanations. Our VideoCon-based alignment model significantly outperforms current models. It exhibits a 12-point increase in AUC for the video-language alignment task on human-generated contrast captions. Finally, our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks such as text-to-video retrieval (SSv2-Temporal) and video question answering (ATP-Hard). Moreover, our model shows superior performance on novel videos and human-crafted captions and explanations. Our code and data are available at https://github.com/Hritikbansal/videocon.

VideoCon: Robust Video-Language Alignment via Contrast Captions

TL;DR

Abstract

Paper Structure (43 sections, 1 equation, 19 figures, 7 tables)

This paper contains 43 sections, 1 equation, 19 figures, 7 tables.

Introduction
Video Language Alignment
Video-Language Entailment (VLE)
Natural Language Explanation (NLE)
VideoCon: Contrast Captions Generation for Robust Video-Language Alignment
Temporally-Challenging Instance Selection
Categories of Contrast Captions
Data Generation using an LLM
Data Generation using Humans
Experimental Setup
Finetuning with VideoCon
VideoCon Evaluation Metrics
Video-Text Downstream Tasks
Baselines
Experiments
...and 28 more sections

Figures (19)

Figure 1: Overview of our VideoCon approach. First, aligned video-language pairs are filtered to retain temporally-challenging instances. Then contrast captions and natural language explanations (NLE) are generated by an LLM to create the VideoCon dataset. Second, a video-language alignment model is finetuned with VideoCon on the alignment and NLE tasks. Finally, the finetuned model is evaluated against the baseline model. Our results show that it outperforms the baseline, achieving SOTA results on downstream tasks.
Figure 2: Overview of the VideoCon data generation process from top to bottom. Specifically, we prompt a large language model (PaLM-2) with the original caption that is grounded in the video, and the intended type of misalignment within the contrast caption. We consider seven kinds of misalignments including object, action, attribute, counting, spatial relation, hallucination, and event order flip. We provide a generated contrast caption and the corresponding natural language explanation for each misalignment type.
Figure 3: Distribution of the types of misalignments within the contrast captions of the VideoCon dataset. We observe that the dataset has good representation for all the kinds of misalignments ranging from $8.8\%$ to $24.2\%$.
Figure 4: Entailment task prompt for finetuning.
Figure 5: NLE generation task prompt for finetuning.
...and 14 more figures

VideoCon: Robust Video-Language Alignment via Contrast Captions

TL;DR

Abstract

VideoCon: Robust Video-Language Alignment via Contrast Captions

Authors

TL;DR

Abstract

Table of Contents

Figures (19)