AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

Xinlong Chen; Yue Ding; Weihong Lin; Jingyun Hua; Linli Yao; Yang Shi; Bozhou Li; Yuanxing Zhang; Qiang Liu; Pengfei Wan; Liang Wang; Tieniu Tan

AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

Xinlong Chen, Yue Ding, Weihong Lin, Jingyun Hua, Linli Yao, Yang Shi, Bozhou Li, Yuanxing Zhang, Qiang Liu, Pengfei Wan, Liang Wang, Tieniu Tan

TL;DR

This work tackles audiovisual video captioning by emphasizing temporal alignment between visual and audio cues, a capability largely missing in vision-centric captioning. It introduces AVoCaDO, a multimodal captioner built on interleaved token sequences (based on Qwen2.5-Omni-7B) and enhanced through a two-stage post-training pipeline: SFT on a curated 107K audiovisual-caption dataset and GRPO with specialized rewards to improve temporal coherence and dialogue fidelity while controlling caption length. The approach combines a data-driven SFT stage with a reinforcement-learning stage that uses a checklist-based reward, a dialogue-based reward, and a length-regularization term to holistically optimize caption quality. Extensive experiments show that AVoCaDO outperforms open-source audiovisual captioning models across four benchmarks and remains competitive on visual-only tasks like VDC Detailed and DREAM-1K, with ablations validating the contribution of each component. The work also provides an open-source release to foster further research in robust, temporally aware multimodal video understanding and generation.

Abstract

Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. In this paper, we present AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. We propose a two-stage post-training pipeline: (1) AVoCaDO SFT, which fine-tunes the model on a newly curated dataset of 107K high-quality, temporally-aligned audiovisual captions; and (2) AVoCaDO GRPO, which leverages tailored reward functions to further enhance temporal coherence and dialogue accuracy while regularizing caption length and reducing collapse. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance on the VDC and DREAM-1K benchmark under visual-only settings.

AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

TL;DR

Abstract

AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)