Table of Contents
Fetching ...

VIVECaption: A Split Approach to Caption Quality Improvement

Varun Ananth, Baqiao Liu, Haoran Cai

TL;DR

This technical report introduces VIVECaption, a systematic two-sided approach to caption quality improvement, focusing on structured caption formats that enable better parsing and downstream utilization and shows that using a finetuned character detection model in an image captioning pipeline significantly improves holistic image-caption alignment quality.

Abstract

Caption quality has emerged as a critical bottleneck in training high-quality text-to-image (T2I) and text-to-video (T2V) generative models. While visual language models (VLMs) are commonly deployed to generate captions from visual data, they suffer from hallucinations, poor compositional reasoning, and limited fine-grained understanding, resulting in misaligned image-caption pairs that degrade downstream model performance. This technical report introduces VIVECaption, a systematic two-sided approach to caption quality improvement. We first establish a comprehensive taxonomy of caption evaluation metrics, distinguishing between "universal" and "instance-grounded" metrics, with the ultimate goal of showcasing the use-cases and tradeoffs between different caption quality metrics. We then use this language to describe our two-sided approach to caption quality improvement: (1) a gold-standard dataset creation methodology using stratified sampling and (2) a model alignment strategy encompassing context alignment and parameter-level finetuning using SFT. We demonstrate our methodology on open-source models, focusing on structured caption formats that enable better parsing and downstream utilization. We ultimately show that using a finetuned character detection model in an image captioning pipeline significantly improves holistic image-caption alignment quality. Our work addresses the growing need for high-quality "vegan" training data in enterprise AI development, providing practical solutions for teams seeking to improve caption-image alignment without relying on potentially copyright-protected web-scraped content.

VIVECaption: A Split Approach to Caption Quality Improvement

TL;DR

This technical report introduces VIVECaption, a systematic two-sided approach to caption quality improvement, focusing on structured caption formats that enable better parsing and downstream utilization and shows that using a finetuned character detection model in an image captioning pipeline significantly improves holistic image-caption alignment quality.

Abstract

Caption quality has emerged as a critical bottleneck in training high-quality text-to-image (T2I) and text-to-video (T2V) generative models. While visual language models (VLMs) are commonly deployed to generate captions from visual data, they suffer from hallucinations, poor compositional reasoning, and limited fine-grained understanding, resulting in misaligned image-caption pairs that degrade downstream model performance. This technical report introduces VIVECaption, a systematic two-sided approach to caption quality improvement. We first establish a comprehensive taxonomy of caption evaluation metrics, distinguishing between "universal" and "instance-grounded" metrics, with the ultimate goal of showcasing the use-cases and tradeoffs between different caption quality metrics. We then use this language to describe our two-sided approach to caption quality improvement: (1) a gold-standard dataset creation methodology using stratified sampling and (2) a model alignment strategy encompassing context alignment and parameter-level finetuning using SFT. We demonstrate our methodology on open-source models, focusing on structured caption formats that enable better parsing and downstream utilization. We ultimately show that using a finetuned character detection model in an image captioning pipeline significantly improves holistic image-caption alignment quality. Our work addresses the growing need for high-quality "vegan" training data in enterprise AI development, providing practical solutions for teams seeking to improve caption-image alignment without relying on potentially copyright-protected web-scraped content.
Paper Structure (19 sections, 5 equations, 13 figures, 3 tables)

This paper contains 19 sections, 5 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: As seen in [\ref{['fig:character_grid']}], the character in the image above is Ellie not Victoria. However, the caption claims Victoria to be the focus of the image. This datapoint, if used to train a downstream T2I model, would actively harm the model's performance.
  • Figure 2: Diagram illustrating the large categories of caption quality metrics: universal and instance-grounded. The two sub-categories of universal metrics are also displayed, with examples for all types.
  • Figure 3: Characters from "Sprite Fright". These images were used for in-context alignment as a reference to both the character detection model and image captioning model. Providing images allows the VLM to "understand" the language particular to this short film. Although obviously preferable, the images do not need to be "uniform" in character posture or angle.
  • Figure 4: (Left) UMAP projection onto two dimensions of CLIP embeddings of all sampled frames. Color distinguishes each of the 310 "clusters" as located by HDBSCAN. Samples from each cluster were used to make a gold-standard dataset. (Right) Pie chart showing the occurrence of each character in the gold-standard dataset after stratified sampling on HDBSCAN clusters. Ellie's over-representation is a consequence of her being the main character and is essentially unavoidable. The total number of samples in the gold standard dataset is 310, which is 14.35% of the entire 2161 frame dataset.
  • Figure 5: (Left) Plot showing relative Qwen2.5-VL parameter sizes and the effect on the average Macro F1 score on the test set pre-and-post SFT. The greatest jump in performance for both the baseline and SFT comes between the 3B and 7B parameter models. (Right) Plot showing relative Qwen2.5-VL parameter sizes and the effect on the average # of mistakes (per example) on the test set pre-and-post SFT. The greatest jump in performance for both the baseline and SFT comes between the 3B and 7B parameter models.
  • ...and 8 more figures