Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption

Luozheng Qin; Zhiyu Tan; Mengping Yang; Xiaomeng Yang; Hao Li

Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption

Luozheng Qin, Zhiyu Tan, Mengping Yang, Xiaomeng Yang, Hao Li

TL;DR

The paper tackles two major challenges in video detailed captioning: imbalanced fine-grained alignment across caption dimensions and misalignment with human preferences. It introduces Cockatiel, a three-stage training pipeline that ensembles synthetic captions from diverse base models and a human-aligned caption quality scorer to curate high-quality training data, followed by training Cockatiel-13B and distilling Cockatiel-8B for accessibility. A human-aligned scorer is developed via annotated data to filter training data, enabling dimension-balanced and human-preferred VDC outputs. Empirical results on the VDCSCORE benchmark show state-of-the-art, dimension-balanced performance and strong human preference alignment, with ablations validating the effectiveness of the scorer, data sizing, and LoRA-based finetuning. The approach offers a scalable path to high-quality, human-aligned VDC systems and demonstrates practical benefits for training-efficient, detailed video captioning models.

Abstract

Video Detailed Captioning (VDC) is a crucial task for vision-language bridging, enabling fine-grained descriptions of complex video content. In this paper, we first comprehensively benchmark current state-of-the-art approaches and systematically identified two critical limitations: biased capability towards specific captioning aspect and misalignment with human preferences. To address these deficiencies, we propose Cockatiel, a novel three-stage training pipeline that ensembles synthetic and human-aligned training for improving VDC performance. In the first stage, we derive a scorer from a meticulously annotated dataset to select synthetic captions high-performing on certain fine-grained video-caption alignment and human-preferred while disregarding others. Then, we train Cockatiel-13B, using this curated dataset to infuse it with assembled model strengths and human preferences. Finally, we further distill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive quantitative and qualitative experiments reflect the effectiveness of our method, as we not only set new state-of-the-art performance on VDCSCORE in a dimension-balanced way but also surpass leading alternatives on human preference by a large margin as depicted by the human evaluation results.

Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption

TL;DR

Abstract

Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)