Table of Contents
Fetching ...

ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

Ali Athar, Xueqing Deng, Liang-Chieh Chen

TL;DR

ViCaS presents the first dataset to jointly evaluate holistic video understanding and pixel-precise, language-grounded segmentation by providing thousands of videos with detailed captions and temporally consistent grounding masks. It introduces two tasks—Video Captioning and Language-Guided Video Instance Segmentation (LG-VIS)—and validates evaluation measures with a user study, while offering Video-LLaVA-Seg as a practical, end-to-end baseline that achieves competitive results. The dataset emphasizes long, descriptive captions with grounding and a large set of groundable objects, enabling robust cross-task learning and evaluation. This work advances practical video understanding by unifying high-level reasoning and fine-grained localization, with potential impact on robotics, video editing, and multimodal AI systems.

Abstract

Recent advances in multimodal large language models (MLLMs) have expanded research in video understanding, primarily focusing on high-level tasks such as video captioning and question-answering. Meanwhile, a smaller body of work addresses dense, pixel-precise segmentation tasks, which typically involve category-guided or referral-based object segmentation. Although both directions are essential for developing models with human-level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human-written captions and temporally consistent, pixel-accurate masks for multiple objects with phrase grounding. Our benchmark evaluates models on both holistic/high-level understanding and language-guided, pixel-precise segmentation. We also present carefully validated evaluation measures and propose an effective model architecture that can tackle our benchmark. Project page: https://ali2500.github.io/vicas-project/

ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

TL;DR

ViCaS presents the first dataset to jointly evaluate holistic video understanding and pixel-precise, language-grounded segmentation by providing thousands of videos with detailed captions and temporally consistent grounding masks. It introduces two tasks—Video Captioning and Language-Guided Video Instance Segmentation (LG-VIS)—and validates evaluation measures with a user study, while offering Video-LLaVA-Seg as a practical, end-to-end baseline that achieves competitive results. The dataset emphasizes long, descriptive captions with grounding and a large set of groundable objects, enabling robust cross-task learning and evaluation. This work advances practical video understanding by unifying high-level reasoning and fine-grained localization, with potential impact on robotics, video editing, and multimodal AI systems.

Abstract

Recent advances in multimodal large language models (MLLMs) have expanded research in video understanding, primarily focusing on high-level tasks such as video captioning and question-answering. Meanwhile, a smaller body of work addresses dense, pixel-precise segmentation tasks, which typically involve category-guided or referral-based object segmentation. Although both directions are essential for developing models with human-level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human-written captions and temporally consistent, pixel-accurate masks for multiple objects with phrase grounding. Our benchmark evaluates models on both holistic/high-level understanding and language-guided, pixel-precise segmentation. We also present carefully validated evaluation measures and propose an effective model architecture that can tackle our benchmark. Project page: https://ali2500.github.io/vicas-project/

Paper Structure

This paper contains 22 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: ViCaS Dataset/Benchmark. Our dataset contains detailed video captions with phrase-level grounding for accurate object segmentation masks. The benchmark comprises two tasks to evaluate holistic and pixel-level video understanding, respectively.
  • Figure 2: ViCaS Examples. Our dataset showcases diverse scenes with a variety of objects and video events, along with detailed captions. Phrases referring to multiple objects are written with multiple colors, e.g., "three yellow balls" in row 2 references three different objects.
  • Figure 3:
  • Figure 4:
  • Figure 5:
  • ...and 3 more figures