AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, Ruihang Chu
TL;DR
AnyCap tackles the need for fine-grained, instruction-aligned captions across images, videos, and audio by introducing a plug-and-play residual-correction framework (AnyCapModel) that refines base captions without retraining the underlying models. It pairs this with AnyCapData, a 300k triplet dataset of instructions and high-quality captions across three modalities, and AnyCapEval, a two-dimensional evaluation scheme (content and style) with the Keypoint Density metric. Empirical results show consistent improvements in content fidelity and stylistic alignment across diverse backbones and benchmarks, including public datasets like MIA-Bench and VidCapBench. The work demonstrates substantial practical impact for controllable multimodal captioning and provides resources for reproducibility and further research.
Abstract
Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300\,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4oś content scores by 45\% and style scores by 12\%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.
