Table of Contents
Fetching ...

Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data

Wufei Ma, Kai Li, Zhongshi Jiang, Moustafa Meshry, Qihao Liu, Huiyu Wang, Christian Häne, Alan Yuille

TL;DR

This work questions whether current video-text foundation models truly understand dynamic videos, arguing that many standard benchmarks admit shortcuts. It introduces Retrieval from Counterfactually Augmented Data (RCAD) and Feint6K to require cross-frame reasoning, where negative captions are plausibly altered actions within the same visual context. The authors identify shortcut learning as a key limitation of existing contrastive objectives and propose LLM-teacher, which leverages pretrained LLMs to generate and distill knowledge into action semantics, significantly boosting RCAD performance across multiple models while preserving near-unchanged standard retrieval. Humans remain substantially ahead on RCAD, highlighting a gap and providing a clear target for future multi-modal learning and evaluation. Overall, RCAD and LLM-teacher offer a concrete benchmark and a practical path toward deeper action understanding in video-text models with meaningful implications for evaluation and training regimes.

Abstract

Recent video-text foundation models have demonstrated strong performance on a wide variety of downstream video understanding tasks. Can these video-text models genuinely understand the contents of natural videos? Standard video-text evaluations could be misleading as many questions can be inferred merely from the objects and contexts in a single frame or biases inherent in the datasets. In this paper, we aim to better assess the capabilities of current video-text models and understand their limitations. We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD), and a new Feint6K dataset. To succeed on our new evaluation task, models must derive a comprehensive understanding of the video from cross-frame reasoning. Analyses show that previous video-text foundation models can be easily fooled by counterfactually augmented data and are far behind human-level performance. In order to narrow the gap between video-text models and human performance on RCAD, we identify a key limitation of current contrastive approaches on video-text data and introduce LLM-teacher, a more effective approach to learn action semantics by leveraging knowledge obtained from a pretrained large language model. Experiments and analyses show that our approach successfully learn more discriminative action embeddings and improves results on Feint6K when applied to multiple video-text models. Our Feint6K dataset and project page is available at https://feint6k.github.io.

Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data

TL;DR

This work questions whether current video-text foundation models truly understand dynamic videos, arguing that many standard benchmarks admit shortcuts. It introduces Retrieval from Counterfactually Augmented Data (RCAD) and Feint6K to require cross-frame reasoning, where negative captions are plausibly altered actions within the same visual context. The authors identify shortcut learning as a key limitation of existing contrastive objectives and propose LLM-teacher, which leverages pretrained LLMs to generate and distill knowledge into action semantics, significantly boosting RCAD performance across multiple models while preserving near-unchanged standard retrieval. Humans remain substantially ahead on RCAD, highlighting a gap and providing a clear target for future multi-modal learning and evaluation. Overall, RCAD and LLM-teacher offer a concrete benchmark and a practical path toward deeper action understanding in video-text models with meaningful implications for evaluation and training regimes.

Abstract

Recent video-text foundation models have demonstrated strong performance on a wide variety of downstream video understanding tasks. Can these video-text models genuinely understand the contents of natural videos? Standard video-text evaluations could be misleading as many questions can be inferred merely from the objects and contexts in a single frame or biases inherent in the datasets. In this paper, we aim to better assess the capabilities of current video-text models and understand their limitations. We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD), and a new Feint6K dataset. To succeed on our new evaluation task, models must derive a comprehensive understanding of the video from cross-frame reasoning. Analyses show that previous video-text foundation models can be easily fooled by counterfactually augmented data and are far behind human-level performance. In order to narrow the gap between video-text models and human performance on RCAD, we identify a key limitation of current contrastive approaches on video-text data and introduce LLM-teacher, a more effective approach to learn action semantics by leveraging knowledge obtained from a pretrained large language model. Experiments and analyses show that our approach successfully learn more discriminative action embeddings and improves results on Feint6K when applied to multiple video-text models. Our Feint6K dataset and project page is available at https://feint6k.github.io.
Paper Structure (44 sections, 2 equations, 11 figures, 2 tables)

This paper contains 44 sections, 2 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: (a): Although with large-scale pretraining on web-scale data, current video-text model like wang2022internvideo can be easily fooled by counterfactually augmented data. (b): The performance of InternVideo on retrieval from counterfactually augmented data (RCAD) drops by over 30% when compared to the standard video-to-text retrieval and by 38.6% when compared to human-level performance. We also evaluate models on standard video-text retrieval from only 6 sampled candidates and show that our RCAD task is indeed much more challenging. Our LLM-teacher successfully improves the performance on RCAD by enforcing a more effective learning of action semantics.
  • Figure 2: Different evaluations of video-text understanding.(a): In standard video-to-text retrieval, negative captions are sampled from different videos in the same dataset. Image-text models can achieve good performance by exploiting shortcuts (e.g., "football" and "cymbals") and biases (e.g., spurious correlation between "outdoor" and "football"). (b): In our proposed RCAD paradigm, we adopt a human-in-the-loop system (see \ref{['sec:data_human']}) to obtain "hard" negative captions with unchanged object entities but modified actions. Models must develop a holistic understanding of the semantics from the sequence of frames to retrieve the matched caption.
  • Figure 3: Examples of RCAD on our Feint6K dataset.
  • Figure 4: Overview of our data collection pipeline for Feint6K dataset featuring a human-in-the-loop system.
  • Figure 5: Change of cosine similarity w.r.t. objects or actions.(a): Comparison between the change in cosine similarity when the action or object is swapped. Results show that current video-text models learn a more effective embedding for objects than for actions. (b): Comparison between the change in cosine similarity using InternVideo or our LLM-teacher. This demonstrate that LLM-teacher learns a more discriminative embedding for actions by enabling a more effective contrastive learning using knowledge from LLMs. Refer to \ref{['sec:main_results']} for more details.
  • ...and 6 more figures