Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

Yunheng Li; Hengrui Zhang; Meng-Hao Guo; Wenzhao Gao; Shaoyong Jia; Shaohui Jiao; Qibin Hou; Ming-Ming Cheng

Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

Yunheng Li, Hengrui Zhang, Meng-Hao Guo, Wenzhao Gao, Shaoyong Jia, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng

TL;DR

Experiments across seven benchmarks show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following and achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.

Abstract

Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descriptions, lacking fine-grained organization and reliable annotation. To address this, we introduce: (i) ASID-1M, an open-source collection of one million structured, fine-grained audiovisual instruction annotations with single- and multi-attribute supervision; (ii) ASID-Verify, a scalable data curation pipeline for annotation, with automatic verification and refinement that enforces semantic and temporal consistency between descriptions and the corresponding audiovisual content; and (iii) ASID-Captioner, a video understanding model trained via Supervised Fine-Tuning (SFT) on the ASID-1M. Experiments across seven benchmarks covering audiovisual captioning, attribute-wise captioning, caption-based QA, and caption-based temporal grounding show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following. It achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.

Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

TL;DR

Abstract

Paper Structure (24 sections, 2 equations, 20 figures, 11 tables)

This paper contains 24 sections, 2 equations, 20 figures, 11 tables.

Introduction
Related Work
Audiovisual Multimodal Models
Datasets for Audiovisual Understanding
ASID-1M and ASID-Verify Pipeline
S1: Multi-Source Annotation Generation
S2: Caption Ensembling and Verification
S3: Attribute-Based Evaluation and Refinement
Analysis of Stage-Wise Contributions
Progressive Attribute Learning
Experiments
Benchmarks
Main Results
Ablation Study
Attribute-level Instruction Following
...and 9 more sections

Figures (20)

Figure 1: Motivation. Existing caption supervision is typically unstructured, leading to incomplete descriptions. In contrast, our attribute-structured supervision aligns and validates each aspect against audiovisual evidence, enabling fine-grained learning.
Figure 2: Overview of ASID-Verify. Multi-source audiovisual annotations are first generated and ensembled with ASR alignment and temporal consistency verification. Captions are then evaluated at the attribute level to identify missing or incorrect content and refined in a targeted manner, producing attribute-structured and quality-verified audiovisual instructions.
Figure 3: Stage-wise analysis of annotation quality and errors under progressively refined training data.
Figure 4: Overview of progressive attribute learning with stage-wise training and controllable attribute selection at inference.
Figure 5: Example of an attribute-structured audiovisual caption generated by ASID-Captioner, with timestamps and grounded speech; color highlights indicate the corresponding attribute groups.
...and 15 more figures

Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

TL;DR

Abstract

Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

Authors

TL;DR

Abstract

Table of Contents

Figures (20)