Table of Contents
Fetching ...

VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations

Qianqian Qiao, DanDan Zheng, Yihang Bo, Bao Peng, Heng Huang, Longteng Jiang, Huaye Wang, Jingdong Chen, Jun Zhou, Xin Jin

TL;DR

This work tackles the lack of large-scale, richly annotated video aesthetics data by introducing VADB, the largest dataset of 10,490 videos annotated by 37 professionals across 11 score dimensions plus language comments and tags. It also presents VADB-Net, a two-stage framework that pre-trains a CLIP-based video encoder with dual text inputs (comments and tags) and a dynamic fusion mechanism, then fine-tunes a regression head for aesthetic scoring. The dataset provides multi-dimensional annotations and rigorous quality control, while the model achieves superior performance over existing video quality assessment baselines and supports downstream aesthetic tasks. The work advances practical video aesthetics research by enabling robust, cross-modal learning and establishing open data and code for reproducibility.

Abstract

Video aesthetic assessment, a vital area in multimedia computing, integrates computer vision with human cognition. Its progress is limited by the lack of standardized datasets and robust models, as the temporal dynamics of video and multimodal fusion challenges hinder direct application of image-based methods. This study introduces VADB, the largest video aesthetic database with 10,490 diverse videos annotated by 37 professionals across multiple aesthetic dimensions, including overall and attribute-specific aesthetic scores, rich language comments and objective tags. We propose VADB-Net, a dual-modal pre-training framework with a two-stage training strategy, which outperforms existing video quality assessment models in scoring tasks and supports downstream video aesthetic assessment tasks. The dataset and source code are available at https://github.com/BestiVictory/VADB.

VADB: A Large-Scale Video Aesthetic Database with Professional and Multi-Dimensional Annotations

TL;DR

This work tackles the lack of large-scale, richly annotated video aesthetics data by introducing VADB, the largest dataset of 10,490 videos annotated by 37 professionals across 11 score dimensions plus language comments and tags. It also presents VADB-Net, a two-stage framework that pre-trains a CLIP-based video encoder with dual text inputs (comments and tags) and a dynamic fusion mechanism, then fine-tunes a regression head for aesthetic scoring. The dataset provides multi-dimensional annotations and rigorous quality control, while the model achieves superior performance over existing video quality assessment baselines and supports downstream aesthetic tasks. The work advances practical video aesthetics research by enabling robust, cross-modal learning and establishing open data and code for reproducibility.

Abstract

Video aesthetic assessment, a vital area in multimedia computing, integrates computer vision with human cognition. Its progress is limited by the lack of standardized datasets and robust models, as the temporal dynamics of video and multimodal fusion challenges hinder direct application of image-based methods. This study introduces VADB, the largest video aesthetic database with 10,490 diverse videos annotated by 37 professionals across multiple aesthetic dimensions, including overall and attribute-specific aesthetic scores, rich language comments and objective tags. We propose VADB-Net, a dual-modal pre-training framework with a two-stage training strategy, which outperforms existing video quality assessment models in scoring tasks and supports downstream video aesthetic assessment tasks. The dataset and source code are available at https://github.com/BestiVictory/VADB.

Paper Structure

This paper contains 42 sections, 2 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 2: Annotation guidelines and example videos for aesthetic scoring of character videos: 1-3 show significant technical and aesthetic flaws; 4-5 meet basic standards with evident weaknesses; 6-7 meet standards with average execution; 8-9 exhibit technical skill and artistic merit; 10 reflect exceptional technical and artistic integration.
  • Figure 3: Scoring criteria and example videos for the aesthetic attributes of natural scenery videos. Additionally, we provided similarly detailed scoring criteria for the aesthetic attributes of character, food, and architecture videos. Example videos were selected by the expert team to illustrate the characteristics of each score range, with at least three videos of different scenarios for each range, enabling annotators to intuitively understand the evaluation criteria.
  • Figure 4: Annotators label only the Tag Category, guided by a standardized document with explanations and example videos.
  • Figure 5: Composition and qualifications of the annotation team
  • Figure 6: Histogram of overall and attribute score distributions, showing a dense mid-range and sparse extremes, consistent with typical rating patterns.
  • ...and 6 more figures