Table of Contents
Fetching ...

Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

Yunheng Li, Hengrui Zhang, Meng-Hao Guo, Wenzhao Gao, Shaoyong Jia, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng

TL;DR

Experiments across seven benchmarks show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following and achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.

Abstract

Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descriptions, lacking fine-grained organization and reliable annotation. To address this, we introduce: (i) ASID-1M, an open-source collection of one million structured, fine-grained audiovisual instruction annotations with single- and multi-attribute supervision; (ii) ASID-Verify, a scalable data curation pipeline for annotation, with automatic verification and refinement that enforces semantic and temporal consistency between descriptions and the corresponding audiovisual content; and (iii) ASID-Captioner, a video understanding model trained via Supervised Fine-Tuning (SFT) on the ASID-1M. Experiments across seven benchmarks covering audiovisual captioning, attribute-wise captioning, caption-based QA, and caption-based temporal grounding show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following. It achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.

Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

TL;DR

Experiments across seven benchmarks show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following and achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.

Abstract

Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descriptions, lacking fine-grained organization and reliable annotation. To address this, we introduce: (i) ASID-1M, an open-source collection of one million structured, fine-grained audiovisual instruction annotations with single- and multi-attribute supervision; (ii) ASID-Verify, a scalable data curation pipeline for annotation, with automatic verification and refinement that enforces semantic and temporal consistency between descriptions and the corresponding audiovisual content; and (iii) ASID-Captioner, a video understanding model trained via Supervised Fine-Tuning (SFT) on the ASID-1M. Experiments across seven benchmarks covering audiovisual captioning, attribute-wise captioning, caption-based QA, and caption-based temporal grounding show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following. It achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.
Paper Structure (24 sections, 2 equations, 20 figures, 11 tables)

This paper contains 24 sections, 2 equations, 20 figures, 11 tables.

Figures (20)

  • Figure 1: Motivation. Existing caption supervision is typically unstructured, leading to incomplete descriptions. In contrast, our attribute-structured supervision aligns and validates each aspect against audiovisual evidence, enabling fine-grained learning.
  • Figure 2: Overview of ASID-Verify. Multi-source audiovisual annotations are first generated and ensembled with ASR alignment and temporal consistency verification. Captions are then evaluated at the attribute level to identify missing or incorrect content and refined in a targeted manner, producing attribute-structured and quality-verified audiovisual instructions.
  • Figure 3: Stage-wise analysis of annotation quality and errors under progressively refined training data.
  • Figure 4: Overview of progressive attribute learning with stage-wise training and controllable attribute selection at inference.
  • Figure 5: Example of an attribute-structured audiovisual caption generated by ASID-Captioner, with timestamps and grounded speech; color highlights indicate the corresponding attribute groups.
  • ...and 15 more figures