Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks
Nina Shvetsova, Arsha Nagrani, Bernt Schiele, Hilde Kuehne, Christian Rupprecht
TL;DR
The paper tackles representation biases in video benchmarks by introducing Unbiasing through Textual Descriptions (UTD), which uses Vision-Language Models to generate frame-level textual descriptions and Large Language Models to extract objects, activities, and verbs. By evaluating object-, temporal-, and common sense vs. dataset bias across 12 datasets and 30 models, the authors construct UTD-descriptions and UTD-splits to debias test sets without modifying video frames. They show that object bias dominates many benchmarks and that debiased splits yield a more robust assessment of true video understanding, with larger backbones displaying varying sensitivity to bias. The work contributes UTD-descriptions, UTD-splits, and a comprehensive benchmark analysis that supports building less biased video benchmarks and more robust video-understanding models. It also discusses limitations related to hallucinations in VLMs and biases in embedding models, offering a practical, scalable path for future benchmarking work.
Abstract
We propose a new "Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets to enable a more robust assessment of video understanding capabilities. Namely, we tackle the problem that current video benchmarks may suffer from different representation biases, e.g., object bias or single-frame bias, where mere recognition of objects or utilization of only a single frame is sufficient for correct prediction. We leverage VLMs and LLMs to analyze and debias benchmarks from such representation biases. Specifically, we generate frame-wise textual descriptions of videos, filter them for specific information (e.g. only objects) and leverage them to examine representation biases across three dimensions: 1) concept bias - determining if a specific concept (e.g., objects) alone suffice for prediction; 2) temporal bias - assessing if temporal information contributes to prediction; and 3) common sense vs. dataset bias - evaluating whether zero-shot reasoning or dataset correlations contribute to prediction. We conduct a systematic analysis of 12 popular video classification and retrieval datasets and create new object-debiased test splits for these datasets. Moreover, we benchmark 30 state-of-the-art video models on original and debiased splits and analyze biases in the models. To facilitate the future development of more robust video understanding benchmarks and models, we release: "UTD-descriptions", a dataset with our rich structured descriptions for each dataset, and "UTD-splits", a dataset of object-debiased test splits.
