Table of Contents
Fetching ...

Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption

Luozheng Qin, Zhiyu Tan, Mengping Yang, Xiaomeng Yang, Hao Li

TL;DR

The paper tackles two major challenges in video detailed captioning: imbalanced fine-grained alignment across caption dimensions and misalignment with human preferences. It introduces Cockatiel, a three-stage training pipeline that ensembles synthetic captions from diverse base models and a human-aligned caption quality scorer to curate high-quality training data, followed by training Cockatiel-13B and distilling Cockatiel-8B for accessibility. A human-aligned scorer is developed via annotated data to filter training data, enabling dimension-balanced and human-preferred VDC outputs. Empirical results on the VDCSCORE benchmark show state-of-the-art, dimension-balanced performance and strong human preference alignment, with ablations validating the effectiveness of the scorer, data sizing, and LoRA-based finetuning. The approach offers a scalable path to high-quality, human-aligned VDC systems and demonstrates practical benefits for training-efficient, detailed video captioning models.

Abstract

Video Detailed Captioning (VDC) is a crucial task for vision-language bridging, enabling fine-grained descriptions of complex video content. In this paper, we first comprehensively benchmark current state-of-the-art approaches and systematically identified two critical limitations: biased capability towards specific captioning aspect and misalignment with human preferences. To address these deficiencies, we propose Cockatiel, a novel three-stage training pipeline that ensembles synthetic and human-aligned training for improving VDC performance. In the first stage, we derive a scorer from a meticulously annotated dataset to select synthetic captions high-performing on certain fine-grained video-caption alignment and human-preferred while disregarding others. Then, we train Cockatiel-13B, using this curated dataset to infuse it with assembled model strengths and human preferences. Finally, we further distill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive quantitative and qualitative experiments reflect the effectiveness of our method, as we not only set new state-of-the-art performance on VDCSCORE in a dimension-balanced way but also surpass leading alternatives on human preference by a large margin as depicted by the human evaluation results.

Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption

TL;DR

The paper tackles two major challenges in video detailed captioning: imbalanced fine-grained alignment across caption dimensions and misalignment with human preferences. It introduces Cockatiel, a three-stage training pipeline that ensembles synthetic captions from diverse base models and a human-aligned caption quality scorer to curate high-quality training data, followed by training Cockatiel-13B and distilling Cockatiel-8B for accessibility. A human-aligned scorer is developed via annotated data to filter training data, enabling dimension-balanced and human-preferred VDC outputs. Empirical results on the VDCSCORE benchmark show state-of-the-art, dimension-balanced performance and strong human preference alignment, with ablations validating the effectiveness of the scorer, data sizing, and LoRA-based finetuning. The approach offers a scalable path to high-quality, human-aligned VDC systems and demonstrates practical benefits for training-efficient, detailed video captioning models.

Abstract

Video Detailed Captioning (VDC) is a crucial task for vision-language bridging, enabling fine-grained descriptions of complex video content. In this paper, we first comprehensively benchmark current state-of-the-art approaches and systematically identified two critical limitations: biased capability towards specific captioning aspect and misalignment with human preferences. To address these deficiencies, we propose Cockatiel, a novel three-stage training pipeline that ensembles synthetic and human-aligned training for improving VDC performance. In the first stage, we derive a scorer from a meticulously annotated dataset to select synthetic captions high-performing on certain fine-grained video-caption alignment and human-preferred while disregarding others. Then, we train Cockatiel-13B, using this curated dataset to infuse it with assembled model strengths and human preferences. Finally, we further distill Cockatiel-8B from Cockatiel-13B for the ease of usage. Extensive quantitative and qualitative experiments reflect the effectiveness of our method, as we not only set new state-of-the-art performance on VDCSCORE in a dimension-balanced way but also surpass leading alternatives on human preference by a large margin as depicted by the human evaluation results.

Paper Structure

This paper contains 30 sections, 3 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Cockatiel, a three-stage training pipeline that ensembles synthetic and human-aligned training. Cockatiel-13B are capable to generate detailed captions that consistently aligns with every visual element in the input video (top left). Furthermore, Cockatiel-13B achieves new state-of-the-art and considerable dimension-balanced performance on VDCSCORE while consistently voted as the most human-aligned models compared to baselines (right). The key contributor to these capabilities is the ensembling synthetic and human preferenced training, which infuses Cockatiel-13B with diverse strengths of leading VDC models and human preferences (bottom left).
  • Figure 2: Overall pipeline of our proposed Cockatiel. Our training pipeline successfully ensemble both the advantages of base models and human preferences, yielding our Cockatiel captioner series. Through the ensembling synthetic and human preferenced training, Cockatiel-13B achieves significant VDC performance while being preferred by humans.
  • Figure 3: Qualitative comparison between Cockatiel-13B and the current sota VDC models. For a detailed comparison between Cockatiel-13B and all leading VDC models, please refer to the supplementary files. The caption content that is exclusively captured by our model, captured by our model and other baselines, or misaligned with the detailed visual elements in the videos are emphasized using red, yellow and green backgrounds.
  • Figure 4: Ablation studies on the LoRA rank (left), training dataset size (middle), and the quality score threshold (right). For brevity, we report only the average accuracy on VDCSCORE; more comprehensive results are provided in the supplementary materials. The hyper-parameter settings are consistent across all the ablation studies, except the ablated one. Specifically, the default settings are as follows: LoRA rank is set to 256, the training dataset size is 20k, and the threshold for the quality score is 3.5, the selection policy is the scorer-based selection policy with threshold on quality score.
  • Figure 5: Human evaluation results. Our method, Cockatiel-13B, is obviously more human-preferred compared to baselines.
  • ...and 5 more figures