Table of Contents
Fetching ...

AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

Xinlong Chen, Yue Ding, Weihong Lin, Jingyun Hua, Linli Yao, Yang Shi, Bozhou Li, Yuanxing Zhang, Qiang Liu, Pengfei Wan, Liang Wang, Tieniu Tan

TL;DR

This work tackles audiovisual video captioning by emphasizing temporal alignment between visual and audio cues, a capability largely missing in vision-centric captioning. It introduces AVoCaDO, a multimodal captioner built on interleaved token sequences (based on Qwen2.5-Omni-7B) and enhanced through a two-stage post-training pipeline: SFT on a curated 107K audiovisual-caption dataset and GRPO with specialized rewards to improve temporal coherence and dialogue fidelity while controlling caption length. The approach combines a data-driven SFT stage with a reinforcement-learning stage that uses a checklist-based reward, a dialogue-based reward, and a length-regularization term to holistically optimize caption quality. Extensive experiments show that AVoCaDO outperforms open-source audiovisual captioning models across four benchmarks and remains competitive on visual-only tasks like VDC Detailed and DREAM-1K, with ablations validating the contribution of each component. The work also provides an open-source release to foster further research in robust, temporally aware multimodal video understanding and generation.

Abstract

Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. In this paper, we present AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. We propose a two-stage post-training pipeline: (1) AVoCaDO SFT, which fine-tunes the model on a newly curated dataset of 107K high-quality, temporally-aligned audiovisual captions; and (2) AVoCaDO GRPO, which leverages tailored reward functions to further enhance temporal coherence and dialogue accuracy while regularizing caption length and reducing collapse. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance on the VDC and DREAM-1K benchmark under visual-only settings.

AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

TL;DR

This work tackles audiovisual video captioning by emphasizing temporal alignment between visual and audio cues, a capability largely missing in vision-centric captioning. It introduces AVoCaDO, a multimodal captioner built on interleaved token sequences (based on Qwen2.5-Omni-7B) and enhanced through a two-stage post-training pipeline: SFT on a curated 107K audiovisual-caption dataset and GRPO with specialized rewards to improve temporal coherence and dialogue fidelity while controlling caption length. The approach combines a data-driven SFT stage with a reinforcement-learning stage that uses a checklist-based reward, a dialogue-based reward, and a length-regularization term to holistically optimize caption quality. Extensive experiments show that AVoCaDO outperforms open-source audiovisual captioning models across four benchmarks and remains competitive on visual-only tasks like VDC Detailed and DREAM-1K, with ablations validating the contribution of each component. The work also provides an open-source release to foster further research in robust, temporally aware multimodal video understanding and generation.

Abstract

Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. In this paper, we present AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. We propose a two-stage post-training pipeline: (1) AVoCaDO SFT, which fine-tunes the model on a newly curated dataset of 107K high-quality, temporally-aligned audiovisual captions; and (2) AVoCaDO GRPO, which leverages tailored reward functions to further enhance temporal coherence and dialogue accuracy while regularizing caption length and reducing collapse. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance on the VDC and DREAM-1K benchmark under visual-only settings.

Paper Structure

This paper contains 35 sections, 10 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Schematic illustration of the pilot experiment. In this example, naively concatenating captions from the video and audio modalities fails to yield a correct answer to the corresponding question. In contrast, jointly processing both modalities to generate a time-aligned caption provides sufficient information, as indicated by the underlined text.
  • Figure 2: The pipeline for generating high-quality temporally-aligned audiovisual video captions. For clarity, corresponding audio-visual events before and after fusion are marked with circled numbers and underlined for reference.
  • Figure 3: Illustration of the three rewards $\mathcal{R}_C$, $\mathcal{R}_D$, and $\mathcal{R}_L$, which are specifically designed for improving the quality of audiovisual video captioning.
  • Figure 4: An illustration of a video caption generated by AVoCaDO, featuring both precise audiovisual temporal alignment and accurate dialogue rendering.
  • Figure 5: Distribution of caption token lengths across video durations.
  • ...and 11 more figures