Table of Contents
Fetching ...

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu

TL;DR

This work tackles omni-modality video understanding by introducing the VAST-27M dataset and the VAST foundation model that jointly model vision, audio, subtitles, and text.A two-stage data generation pipeline uses vision and audio captioners plus an LLM to produce rich omni-modality captions, enabling robust cross-modality learning for retrieval, captioning, and QA.Empirical results show VAST achieves numerous state-of-the-art results across vision-text, audio-text, and multi-modal video-text benchmarks with high efficiency and broad task coverage.The work also analyzes open-source corpora vs. VAST-27M, demonstrates the benefits of LLM-driven omni-captioning, and discusses biases and limitations, outlining directions for larger and more diverse omni-modality data.

Abstract

Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

TL;DR

This work tackles omni-modality video understanding by introducing the VAST-27M dataset and the VAST foundation model that jointly model vision, audio, subtitles, and text.A two-stage data generation pipeline uses vision and audio captioners plus an LLM to produce rich omni-modality captions, enabling robust cross-modality learning for retrieval, captioning, and QA.Empirical results show VAST achieves numerous state-of-the-art results across vision-text, audio-text, and multi-modal video-text benchmarks with high efficiency and broad task coverage.The work also analyzes open-source corpora vs. VAST-27M, demonstrates the benefits of LLM-driven omni-captioning, and discusses biases and limitations, outlining directions for larger and more diverse omni-modality data.

Abstract

Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.
Paper Structure (26 sections, 5 equations, 5 figures, 17 tables)

This paper contains 26 sections, 5 equations, 5 figures, 17 tables.

Figures (5)

  • Figure 1: Illustration of the difference between conventional cross-modality pretraining and the proposed omni-modality pretraining. Thanks to the proposed omni-modality video caption corpus VAST-27M, the VAST foundation model can perceive videos from multiple information sources, including vision, audio, and subtitles, and enhance the connections between omni-modalities videos (OMV) and omni-modality captions (OMC) through large-scale pretraining. A, V, S, and T represent audio, vision, subtitle, and text, respectively. AC, VC, and AVC are abbreviations for audio, vision, and audiovisual captions.
  • Figure 2: Illustration of the caption generation process of VAST-27M (top) and training framework of VAST (bottom). The vision and audio captioners generate captions based on the input video clip, and the Omni-Modality Captioner (Vicuna-13b) integrates them along with the raw subtitle and instructional prompts, to generate the omni-modality caption. The VAST model consists of three encoders and is trained under three objectives including OM-VCC, OM-VCM, and OM-VCG.
  • Figure 3: Word cloud map (Top-200) for vision, audio, omni-modality captions and raw subtitles of VAST-27M.
  • Figure 4: Ablation study for instructional prompt used for omni-modality video caption generation in VAST-27M.
  • Figure 5: More samples in VAST-27M.