Table of Contents
Fetching ...

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, Ruihang Chu

TL;DR

AnyCap tackles the need for fine-grained, instruction-aligned captions across images, videos, and audio by introducing a plug-and-play residual-correction framework (AnyCapModel) that refines base captions without retraining the underlying models. It pairs this with AnyCapData, a 300k triplet dataset of instructions and high-quality captions across three modalities, and AnyCapEval, a two-dimensional evaluation scheme (content and style) with the Keypoint Density metric. Empirical results show consistent improvements in content fidelity and stylistic alignment across diverse backbones and benchmarks, including public datasets like MIA-Bench and VidCapBench. The work demonstrates substantial practical impact for controllable multimodal captioning and provides resources for reproducibility and further research.

Abstract

Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300\,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4oś content scores by 45\% and style scores by 12\%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

TL;DR

AnyCap tackles the need for fine-grained, instruction-aligned captions across images, videos, and audio by introducing a plug-and-play residual-correction framework (AnyCapModel) that refines base captions without retraining the underlying models. It pairs this with AnyCapData, a 300k triplet dataset of instructions and high-quality captions across three modalities, and AnyCapEval, a two-dimensional evaluation scheme (content and style) with the Keypoint Density metric. Empirical results show consistent improvements in content fidelity and stylistic alignment across diverse backbones and benchmarks, including public datasets like MIA-Bench and VidCapBench. The work demonstrates substantial practical impact for controllable multimodal captioning and provides resources for reproducibility and further research.

Abstract

Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300\,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4oś content scores by 45\% and style scores by 12\%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.

Paper Structure

This paper contains 39 sections, 1 equation, 10 figures, 23 tables.

Figures (10)

  • Figure 2: AnyCap framework. (a) AnyCap is stacked upon various MLLMs, refining their initial captions into high-quality, instruction-aligned outputs. (b) Specifically, AnyCap takes as input the initial caption, the original modality data, and the user instruction to produce the final caption.
  • Figure 3: Details of AnyCapEval. (a) Examples for content evaluation via KPD and style evaluation via scoring rules. (b) AnyCapEval judgments highly align with human preference. (c) Integrating AnyCap consistently boosts base models across both content and style dimensions.
  • Figure 4: Impact of training data ratio.
  • Figure 5: Human evaluation comparing AnyCap-8B with GPT-4o. Captions refined with AnyCap align more closely with the given instructions, enhancing both content and style consistency.
  • Figure 6: AnyCap enables controllable captioning across modalities by refining base model outputs to better align with user instructions. Given a user instruction, it takes initial captions from a foundation model and corrects instruction violations (highlighted in red), producing compliant, instruction-following outputs (green), all without requiring fine-tuning of the base model.
  • ...and 5 more figures