COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation
Xueqing Deng, Qihang Yu, Ali Athar, Chenglin Yang, Linjie Yang, Xiaojie Jin, Xiaohui Shen, Liang-Chieh Chen
TL;DR
The paper tackles the scarcity of richly annotated, region-grounded image-text data by introducing the COCONut-PanCap dataset, which pairs dense panoptic segmentation masks with long, region-grounded captions. It leverages a human-in-the-loop annotation pipeline that uses VLM drafts (GPT-4V) refined by human raters to produce comprehensive, mask-grounded captions for COCO-derived images. The authors also propose PanCaper, a baseline for joint panoptic segmentation and grounded captioning, and demonstrate that training with COCONut-PanCap substantially improves performance across detailed captioning, grounded segmentation, and text-conditioned image generation, as well as downstream VQA and referring segmentation tasks. This work provides a new benchmark for joint panoptic grounding and captioning, supporting high-quality instruction tuning and offering a scalable path toward better multi-modal understanding and generation.
Abstract
This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of vision-language models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.
