VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Weiming Ren; Huan Yang; Jie Min; Cong Wei; Wenhu Chen

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Weiming Ren, Huan Yang, Jie Min, Cong Wei, Wenhu Chen

TL;DR

VISTA tackles the data bottleneck in long-duration and high-resolution video understanding by introducing a data-centric augmentation pipeline that synthesizes extended-duration and higher-resolution video instruction-following data from existing video-caption sources. The framework generates seven augmentation styles to create VISTA-400K, a large synthetic dataset, and introduces HRVideoBench to specifically evaluate high-resolution understanding. Finetuning multiple video LMMs on VISTA-400K yields consistent gains on long-video benchmarks ($ ext{avg.} ext{ +3.3}\%$) and a notable improvement on HRVideoBench ($ ext{avg.} ext{ +6.5}\%$), with ablations confirming each augmentation contributes to performance. The work demonstrates the viability of data-centric growth for video-language models, providing open-source data, a new high-resolution benchmark, and insights into how synthetic, multi-faceted video instruction data can enhance both long- and high-resolution video understanding in open-source models.

Abstract

Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets. To address this issue from a data-centric perspective, we propose VISTA, a simple yet effective Video Spatiotemporal Augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. VISTA spatially and temporally combines videos to create new synthetic videos with extended durations and enhanced resolutions, and subsequently produces question-answer pairs pertaining to these newly synthesized videos. Based on this paradigm, we develop seven video augmentation methods and curate VISTA-400K, a video instruction-following dataset aimed at enhancing long-duration and high-resolution video understanding. Finetuning various video LMMs on our data resulted in an average improvement of 3.3% across four challenging benchmarks for long-video understanding. Furthermore, we introduce the first comprehensive high-resolution video understanding benchmark HRVideoBench, on which our finetuned models achieve a 6.5% performance gain. These results highlight the effectiveness of our framework.

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

TL;DR

Abstract

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)