Table of Contents
Fetching ...

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng Yang, Chaoyou Fu, Xiang Li, Jian Yang, Ying Tai

TL;DR

InstanceCap introduces an instance-aware structured caption framework for text-to-video generation that decomposes videos into local instances using an auxiliary model cluster (AMC) and refines dense prompts into concise, structured phrases via an improved CoT pipeline with multimodal LLMs. A new 22K InstanceVid dataset is created to train this framework, and an InstanceEnhancer module tailors inference prompts to align with the structured caption format. Empirical results show improved fidelity and reduced hallucinations in caption–video pairs, both in video reconstruction and T2V generation, with strong gains in instance detail and motion accuracy. The approach demonstrates that instance-level guidance and carefully designed prompts can substantially enhance video synthesis quality, offering practical benefits for open-domain video generation and downstream applications.

Abstract

Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a 22K InstanceVid dataset is curated for training, and an enhancement pipeline that tailored to InstanceCap structure is proposed for inference. Experimental results demonstrate that our proposed InstanceCap significantly outperform previous models, ensuring high fidelity between captions and videos while reducing hallucinations.

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

TL;DR

InstanceCap introduces an instance-aware structured caption framework for text-to-video generation that decomposes videos into local instances using an auxiliary model cluster (AMC) and refines dense prompts into concise, structured phrases via an improved CoT pipeline with multimodal LLMs. A new 22K InstanceVid dataset is created to train this framework, and an InstanceEnhancer module tailors inference prompts to align with the structured caption format. Empirical results show improved fidelity and reduced hallucinations in caption–video pairs, both in video reconstruction and T2V generation, with strong gains in instance detail and motion accuracy. The approach demonstrates that instance-level guidance and carefully designed prompts can substantially enhance video synthesis quality, offering practical benefits for open-domain video generation and downstream applications.

Abstract

Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a 22K InstanceVid dataset is curated for training, and an enhancement pipeline that tailored to InstanceCap structure is proposed for inference. Experimental results demonstrate that our proposed InstanceCap significantly outperform previous models, ensuring high fidelity between captions and videos while reducing hallucinations.

Paper Structure

This paper contains 45 sections, 2 equations, 23 figures, 4 tables.

Figures (23)

  • Figure 1: Top: Comparison of the reconstruction-via-recaption results between $\mathtt{InstanceCap}$ and state-of-the-art captioning methods for annotating the ground truth video. $\mathtt{InstanceCap}$ produces results that more closely resemble the original video, showing greater detail fidelity (highlighted by the red circle). Bottom: The corresponding captions generated by $\mathtt{InstanceCap}$ and others. Red denotes incorrect captions, blue represents ambiguous captions, and green indicates detailed and accurate descriptions of video. Specific visual hints are marked as A, B, and C for clarity. All videos are generated using the same video generation product, Hailuo AI, which has robust prompt-following capabilities, clearly highlighting the effectiveness of $\mathtt{InstanceCap}$.
  • Figure 1: Quantitative comparisons on reconstruction-via-recaption results. The best results are marked in bold, and the second-best are underscored. As a reference, CogVideoX-5b accepts $226$ text tokens, with any excess being truncated.
  • Figure 2: Overview of InstanceCap pipeline. Details of "from dense prompts to structured phrases" design are shown in Figure \ref{['fig:mllm']}.
  • Figure 3: Details on "from dense prompts to structured phrases" design. We propose an improved CoT pipeline with carefully designed information interactions (red arrow), which facilitates MLLMs to accurately capture instances with precise descriptions on attributes.
  • Figure 4: $\mathtt{InstanceVid}$ provides structured captions for videos in open-domain scenarios, featuring diverse instance, expansive scenes, precise and instance-aware captions, and video-generation-friendly durations.
  • ...and 18 more figures