Table of Contents
Fetching ...

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Weiming Ren, Huan Yang, Jie Min, Cong Wei, Wenhu Chen

TL;DR

VISTA tackles the data bottleneck in long-duration and high-resolution video understanding by introducing a data-centric augmentation pipeline that synthesizes extended-duration and higher-resolution video instruction-following data from existing video-caption sources. The framework generates seven augmentation styles to create VISTA-400K, a large synthetic dataset, and introduces HRVideoBench to specifically evaluate high-resolution understanding. Finetuning multiple video LMMs on VISTA-400K yields consistent gains on long-video benchmarks ($ ext{avg.} ext{ +3.3}\%$) and a notable improvement on HRVideoBench ($ ext{avg.} ext{ +6.5}\%$), with ablations confirming each augmentation contributes to performance. The work demonstrates the viability of data-centric growth for video-language models, providing open-source data, a new high-resolution benchmark, and insights into how synthetic, multi-faceted video instruction data can enhance both long- and high-resolution video understanding in open-source models.

Abstract

Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets. To address this issue from a data-centric perspective, we propose VISTA, a simple yet effective Video Spatiotemporal Augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. VISTA spatially and temporally combines videos to create new synthetic videos with extended durations and enhanced resolutions, and subsequently produces question-answer pairs pertaining to these newly synthesized videos. Based on this paradigm, we develop seven video augmentation methods and curate VISTA-400K, a video instruction-following dataset aimed at enhancing long-duration and high-resolution video understanding. Finetuning various video LMMs on our data resulted in an average improvement of 3.3% across four challenging benchmarks for long-video understanding. Furthermore, we introduce the first comprehensive high-resolution video understanding benchmark HRVideoBench, on which our finetuned models achieve a 6.5% performance gain. These results highlight the effectiveness of our framework.

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

TL;DR

VISTA tackles the data bottleneck in long-duration and high-resolution video understanding by introducing a data-centric augmentation pipeline that synthesizes extended-duration and higher-resolution video instruction-following data from existing video-caption sources. The framework generates seven augmentation styles to create VISTA-400K, a large synthetic dataset, and introduces HRVideoBench to specifically evaluate high-resolution understanding. Finetuning multiple video LMMs on VISTA-400K yields consistent gains on long-video benchmarks () and a notable improvement on HRVideoBench (), with ablations confirming each augmentation contributes to performance. The work demonstrates the viability of data-centric growth for video-language models, providing open-source data, a new high-resolution benchmark, and insights into how synthetic, multi-faceted video instruction data can enhance both long- and high-resolution video understanding in open-source models.

Abstract

Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets. To address this issue from a data-centric perspective, we propose VISTA, a simple yet effective Video Spatiotemporal Augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. VISTA spatially and temporally combines videos to create new synthetic videos with extended durations and enhanced resolutions, and subsequently produces question-answer pairs pertaining to these newly synthesized videos. Based on this paradigm, we develop seven video augmentation methods and curate VISTA-400K, a video instruction-following dataset aimed at enhancing long-duration and high-resolution video understanding. Finetuning various video LMMs on our data resulted in an average improvement of 3.3% across four challenging benchmarks for long-video understanding. Furthermore, we introduce the first comprehensive high-resolution video understanding benchmark HRVideoBench, on which our finetuned models achieve a 6.5% performance gain. These results highlight the effectiveness of our framework.

Paper Structure

This paper contains 30 sections, 1 equation, 4 figures, 7 tables.

Figures (4)

  • Figure 1: VISTA is a simple but effective framework that generates high-quality video instruction data from existing video-caption pairs. Our VISTA-400K dataset enhances model performances on various long and high-resolution video benchmarks.
  • Figure 2: Our proposed video augmentation and instruction-following data synthesis schemes for VISTA-400K. Given input videos, We perform spatiotemporal video combinations to produce augmented video samples with longer duration and higher resolution.
  • Figure 3: Qualitative comparisons between the baseline models and our VISTA-finetuned models. Red text indicates hallucinations or incorrect responses, while green text highlights the correct responses that correspond accurately to the video content.
  • Figure 4: Example questions from our HRVideoBench. Zoom in for better visualizations.