Table of Contents
Fetching ...

COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation

Xueqing Deng, Qihang Yu, Ali Athar, Chenglin Yang, Linjie Yang, Xiaojie Jin, Xiaohui Shen, Liang-Chieh Chen

TL;DR

The paper tackles the scarcity of richly annotated, region-grounded image-text data by introducing the COCONut-PanCap dataset, which pairs dense panoptic segmentation masks with long, region-grounded captions. It leverages a human-in-the-loop annotation pipeline that uses VLM drafts (GPT-4V) refined by human raters to produce comprehensive, mask-grounded captions for COCO-derived images. The authors also propose PanCaper, a baseline for joint panoptic segmentation and grounded captioning, and demonstrate that training with COCONut-PanCap substantially improves performance across detailed captioning, grounded segmentation, and text-conditioned image generation, as well as downstream VQA and referring segmentation tasks. This work provides a new benchmark for joint panoptic grounding and captioning, supporting high-quality instruction tuning and offering a scalable path toward better multi-modal understanding and generation.

Abstract

This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of vision-language models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.

COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation

TL;DR

The paper tackles the scarcity of richly annotated, region-grounded image-text data by introducing the COCONut-PanCap dataset, which pairs dense panoptic segmentation masks with long, region-grounded captions. It leverages a human-in-the-loop annotation pipeline that uses VLM drafts (GPT-4V) refined by human raters to produce comprehensive, mask-grounded captions for COCO-derived images. The authors also propose PanCaper, a baseline for joint panoptic segmentation and grounded captioning, and demonstrate that training with COCONut-PanCap substantially improves performance across detailed captioning, grounded segmentation, and text-conditioned image generation, as well as downstream VQA and referring segmentation tasks. This work provides a new benchmark for joint panoptic grounding and captioning, supporting high-quality instruction tuning and offering a scalable path toward better multi-modal understanding and generation.

Abstract

This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of vision-language models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.

Paper Structure

This paper contains 16 sections, 1 equation, 13 figures, 8 tables.

Figures (13)

  • Figure 1: COCONut-PanCap Dataset.Top: The proposed COCONut-PanCap dataset features detailed captions grounded with dense panoptic segmentation masks. Bottom: COCONut-PanCap supports various fine-grained understanding and generation tasks, including detailed captioning, panoptic segmentation grounded caption, and text-to-image generation. The dataset also facilitates several downstream tasks, such as visual question-answering (VQA) and referring segmentation.
  • Figure 2: Annotation Pipeline. Given an input image, human-annotated panoptic segmentation masks are overlaid using set-of-marks yang2023setofmark visualization techniques to prompt the vision-language model (VLM). After generating an initial draft, human effort is investigated for editing and verification. Finally, the annotated metadata will be formatted to construct the datasets for various tasks at instruction tuning or finetuning stage.
  • Figure 3: Designed Prompt Template. By giving the concatenated set-of-marks images, the right side (round-1) shows the initial response and the corresponding human edits. Once finalized by humans, these edits will be merged into a single detailed caption grounded with panoptic segmentation masks, as shown in the left side (round-2).
  • Figure 4: Frequency of Extracted Nouns from the COCONut-PanCap Dataset. The top 10 most frequent nouns are: people, table, room, street, dining, man, person, cars, chairs, and field.
  • Figure 5: Caption Quality via User Study. The study involved human evaluators assessing a random sample of 1,000 captions, with a strong preference shown for captions from our dataset.
  • ...and 8 more figures