Table of Contents
Fetching ...

ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

Vipula Rawte, Sarthak Jain, Aarush Sinha, Garv Kaushik, Aman Bansal, Prathiksha Rumale Vishwanath, Samyak Rajesh Jain, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amit P. Sheth, Amitava Das

TL;DR

ViBe introduces a large-scale, human-annotated benchmark for text-to-video hallucinations, defining five categories and providing 3,782 videos generated from 837 MS COCO prompts across 10 open-source T2V models. It establishes a practical classification benchmark using TimeSFormer and VideoMAE embeddings, with TimeSFormer + CNN achieving the best baseline performance at $0.345$ accuracy and $0.342$ F1, highlighting both progress and the difficulty of automated detection. The dataset surpasses prior work (e.g., T2VHaluBench) by a substantial margin, enabling robust evaluation of fidelity and prompt adherence, and supporting future improvements through longer videos, expanded taxonomy, and RLHF-based alignment. These contributions offer a valuable resource for researchers to assess, detect, and mitigate hallucinations in text-to-video models, driving the development of more reliable T2V systems and user-aligned outputs.

Abstract

Recent advances in Large Multimodal Models (LMMs) have expanded their capabilities to video understanding, with Text-to-Video (T2V) models excelling in generating videos from textual prompts. However, they still frequently produce hallucinated content, revealing AI-generated inconsistencies. We introduce ViBe (https://vibe-t2v-bench.github.io/): a large-scale dataset of hallucinated videos from open-source T2V models. We identify five major hallucination types: Vanishing Subject, Omission Error, Numeric Variability, Subject Dysmorphia, and Visual Incongruity. Using ten T2V models, we generated and manually annotated 3,782 videos from 837 diverse MS COCO captions. Our proposed benchmark includes a dataset of hallucinated videos and a classification framework using video embeddings. ViBe serves as a critical resource for evaluating T2V reliability and advancing hallucination detection. We establish classification as a baseline, with the TimeSFormer + CNN ensemble achieving the best performance (0.345 accuracy, 0.342 F1 score). While initial baselines proposed achieve modest accuracy, this highlights the difficulty of automated hallucination detection and the need for improved methods. Our research aims to drive the development of more robust T2V models and evaluate their outputs based on user preferences.

ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

TL;DR

ViBe introduces a large-scale, human-annotated benchmark for text-to-video hallucinations, defining five categories and providing 3,782 videos generated from 837 MS COCO prompts across 10 open-source T2V models. It establishes a practical classification benchmark using TimeSFormer and VideoMAE embeddings, with TimeSFormer + CNN achieving the best baseline performance at accuracy and F1, highlighting both progress and the difficulty of automated detection. The dataset surpasses prior work (e.g., T2VHaluBench) by a substantial margin, enabling robust evaluation of fidelity and prompt adherence, and supporting future improvements through longer videos, expanded taxonomy, and RLHF-based alignment. These contributions offer a valuable resource for researchers to assess, detect, and mitigate hallucinations in text-to-video models, driving the development of more reliable T2V systems and user-aligned outputs.

Abstract

Recent advances in Large Multimodal Models (LMMs) have expanded their capabilities to video understanding, with Text-to-Video (T2V) models excelling in generating videos from textual prompts. However, they still frequently produce hallucinated content, revealing AI-generated inconsistencies. We introduce ViBe (https://vibe-t2v-bench.github.io/): a large-scale dataset of hallucinated videos from open-source T2V models. We identify five major hallucination types: Vanishing Subject, Omission Error, Numeric Variability, Subject Dysmorphia, and Visual Incongruity. Using ten T2V models, we generated and manually annotated 3,782 videos from 837 diverse MS COCO captions. Our proposed benchmark includes a dataset of hallucinated videos and a classification framework using video embeddings. ViBe serves as a critical resource for evaluating T2V reliability and advancing hallucination detection. We establish classification as a baseline, with the TimeSFormer + CNN ensemble achieving the best performance (0.345 accuracy, 0.342 F1 score). While initial baselines proposed achieve modest accuracy, this highlights the difficulty of automated hallucination detection and the need for improved methods. Our research aims to drive the development of more robust T2V models and evaluate their outputs based on user preferences.

Paper Structure

This paper contains 20 sections, 1 equation, 21 figures, 6 tables.

Figures (21)

  • Figure 1: To generate the videos, we utilized randomly sampled image captions from the MS COCO dataset as textual inputs for the video generation models. The resulting videos were then manually annotated by human annotators to construct the ViBe dataset. Following annotation, the videos were processed into feature-rich video embeddings using advanced embedding techniques. These embeddings along with human annotated hallucination labels were subsequently input into various classifier models, which were trained to identify and categorize different types of video hallucinations, enabling the detection of discrepancies between the expected and generated content.
  • Figure 2: Prompt: three guys are standing on a beach next to surfboards. Vanishing Subject: The prompt mentions that there are three guys on a beach with surfboards. In the initial frame, we see 3 guys on the beach with surfboards, but in the last frame, we find only two guys remaining. The third guy seems to have vanished.
  • Figure 3: Hierarchy of hallucination categories in $\mathbbmss{ViBe}$.
  • Figure 4: Prompt: Two road workers are standing by a red light with a sign. Numeric Variability: The prompt explicitly mentions two road workers. However, while the system accurately incorporates elements like the red light and depicts one road worker standing, it fails to generate the second road worker as specified in the prompt. The system modifies the specified number of subjects, decreasing their count, which deviates from the original instructions.
  • Figure 5: Prompt: A train heading for a curve in the track. Visual Incongruity: The scenario presents multiple logical and physical impossibilities in its temporal sequence. Initially, no train is visible in the first two frames, violating conservation of mass and the principle of object permanence. In the third frame, the train suddenly materializes on the track without a clear point of origin. In the final frame, the train inexplicably rotates to become perpendicular to the track, an action that defies both the mechanical constraints of train wheels on rails and basic laws of motion. This instantaneous 90-degree rotation would be physically impossible given a train's fixed wheel assembly and its momentum-governed movement along rails.
  • ...and 16 more figures