Table of Contents
Fetching ...

Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks

Nina Shvetsova, Arsha Nagrani, Bernt Schiele, Hilde Kuehne, Christian Rupprecht

TL;DR

The paper tackles representation biases in video benchmarks by introducing Unbiasing through Textual Descriptions (UTD), which uses Vision-Language Models to generate frame-level textual descriptions and Large Language Models to extract objects, activities, and verbs. By evaluating object-, temporal-, and common sense vs. dataset bias across 12 datasets and 30 models, the authors construct UTD-descriptions and UTD-splits to debias test sets without modifying video frames. They show that object bias dominates many benchmarks and that debiased splits yield a more robust assessment of true video understanding, with larger backbones displaying varying sensitivity to bias. The work contributes UTD-descriptions, UTD-splits, and a comprehensive benchmark analysis that supports building less biased video benchmarks and more robust video-understanding models. It also discusses limitations related to hallucinations in VLMs and biases in embedding models, offering a practical, scalable path for future benchmarking work.

Abstract

We propose a new "Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets to enable a more robust assessment of video understanding capabilities. Namely, we tackle the problem that current video benchmarks may suffer from different representation biases, e.g., object bias or single-frame bias, where mere recognition of objects or utilization of only a single frame is sufficient for correct prediction. We leverage VLMs and LLMs to analyze and debias benchmarks from such representation biases. Specifically, we generate frame-wise textual descriptions of videos, filter them for specific information (e.g. only objects) and leverage them to examine representation biases across three dimensions: 1) concept bias - determining if a specific concept (e.g., objects) alone suffice for prediction; 2) temporal bias - assessing if temporal information contributes to prediction; and 3) common sense vs. dataset bias - evaluating whether zero-shot reasoning or dataset correlations contribute to prediction. We conduct a systematic analysis of 12 popular video classification and retrieval datasets and create new object-debiased test splits for these datasets. Moreover, we benchmark 30 state-of-the-art video models on original and debiased splits and analyze biases in the models. To facilitate the future development of more robust video understanding benchmarks and models, we release: "UTD-descriptions", a dataset with our rich structured descriptions for each dataset, and "UTD-splits", a dataset of object-debiased test splits.

Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks

TL;DR

The paper tackles representation biases in video benchmarks by introducing Unbiasing through Textual Descriptions (UTD), which uses Vision-Language Models to generate frame-level textual descriptions and Large Language Models to extract objects, activities, and verbs. By evaluating object-, temporal-, and common sense vs. dataset bias across 12 datasets and 30 models, the authors construct UTD-descriptions and UTD-splits to debias test sets without modifying video frames. They show that object bias dominates many benchmarks and that debiased splits yield a more robust assessment of true video understanding, with larger backbones displaying varying sensitivity to bias. The work contributes UTD-descriptions, UTD-splits, and a comprehensive benchmark analysis that supports building less biased video benchmarks and more robust video-understanding models. It also discusses limitations related to hallucinations in VLMs and biases in embedding models, offering a practical, scalable path for future benchmarking work.

Abstract

We propose a new "Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets to enable a more robust assessment of video understanding capabilities. Namely, we tackle the problem that current video benchmarks may suffer from different representation biases, e.g., object bias or single-frame bias, where mere recognition of objects or utilization of only a single frame is sufficient for correct prediction. We leverage VLMs and LLMs to analyze and debias benchmarks from such representation biases. Specifically, we generate frame-wise textual descriptions of videos, filter them for specific information (e.g. only objects) and leverage them to examine representation biases across three dimensions: 1) concept bias - determining if a specific concept (e.g., objects) alone suffice for prediction; 2) temporal bias - assessing if temporal information contributes to prediction; and 3) common sense vs. dataset bias - evaluating whether zero-shot reasoning or dataset correlations contribute to prediction. We conduct a systematic analysis of 12 popular video classification and retrieval datasets and create new object-debiased test splits for these datasets. Moreover, we benchmark 30 state-of-the-art video models on original and debiased splits and analyze biases in the models. To facilitate the future development of more robust video understanding benchmarks and models, we release: "UTD-descriptions", a dataset with our rich structured descriptions for each dataset, and "UTD-splits", a dataset of object-debiased test splits.

Paper Structure

This paper contains 38 sections, 2 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Can you guess activity on a video based solely on the objects or a single frame? (The answer is on the next page in \ref{['fig:answer']}.)
  • Figure 2: The answer to \ref{['fig:teaser']}. Some videos may exhibit an object representation bias (allowing predictions based solely on objects) or single-frame representation bias (solely on a single frame), while others require more information for prediction.
  • Figure 3: The proposed UTD method involves generating textual descriptions of different concepts in video frames using VLMs and LLMs, combining them in various temporal configurations, and evaluating the performance of these concept-temporal representations with strong text embedding models. For each representation, we distinguish between common sense bias, which relies on zero-shot reasoning by text embedding models, and dataset bias, assessed using a linear model trained on the dataset’s training set.
  • Figure A.1: Comparison of class distribution in full test/val split vs. UTD-debiased split vs. UTD-debiased-balanced split for six considered classification datasets.
  • Figure F.1: Qualitative examples of objects+composition+activities textual descriptions for random videos in MSRVTT dataset.
  • ...and 5 more figures