Table of Contents
Fetching ...

URECA: Unique Region Caption Anything

Sangbeom Lim, Junwan Kim, Heeji Yoon, Jaewoo Jung, Seungryong Kim

TL;DR

The paper tackles the challenge of region-level captioning across multiple granularity levels by introducing the URECA dataset, which enforces a unique region-caption mapping over diverse objects, parts, and backgrounds. It then presents URECA, a model that encodes masked regions with a dedicated mask encoder and dynamic masking to preserve spatial identity, integrating with Multimodal Large Language Models for distinctive, contextually grounded captions. The authors demonstrate state-of-the-art performance on the URECA test set and strong zero-shot generalization to Visual Genome and RefCOCOg, validating both the dataset and architectural design. This work advances practical region understanding by enabling precise, hierarchical, and unique region descriptions that generalize beyond the curated dataset.

Abstract

Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. However, existing methods struggle to produce unique captions across multi-granularity, limiting their real-world applicability. To address the need for detailed region-level understanding, we introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning. Unlike prior datasets that focus primarily on salient objects, URECA dataset ensures a unique and consistent mapping between regions and captions by incorporating a diverse set of objects, parts, and background elements. Central to this is a stage-wise data curation pipeline, where each stage incrementally refines region selection and caption generation. By leveraging Multimodal Large Language Models (MLLMs) at each stage, our pipeline produces distinctive and contextually grounded captions with improved accuracy and semantic diversity. Building upon this dataset, we present URECA, a novel captioning model designed to effectively encode multi-granularity regions. URECA maintains essential spatial properties such as position and shape through simple yet impactful modifications to existing MLLMs, enabling fine-grained and semantically rich region descriptions. Our approach introduces dynamic mask modeling and a high-resolution mask encoder to enhance caption uniqueness. Experiments show that URECA achieves state-of-the-art performance on URECA dataset and generalizes well to existing region-level captioning benchmarks.

URECA: Unique Region Caption Anything

TL;DR

The paper tackles the challenge of region-level captioning across multiple granularity levels by introducing the URECA dataset, which enforces a unique region-caption mapping over diverse objects, parts, and backgrounds. It then presents URECA, a model that encodes masked regions with a dedicated mask encoder and dynamic masking to preserve spatial identity, integrating with Multimodal Large Language Models for distinctive, contextually grounded captions. The authors demonstrate state-of-the-art performance on the URECA test set and strong zero-shot generalization to Visual Genome and RefCOCOg, validating both the dataset and architectural design. This work advances practical region understanding by enabling precise, hierarchical, and unique region descriptions that generalize beyond the curated dataset.

Abstract

Region-level captioning aims to generate natural language descriptions for specific image regions while highlighting their distinguishing features. However, existing methods struggle to produce unique captions across multi-granularity, limiting their real-world applicability. To address the need for detailed region-level understanding, we introduce URECA dataset, a large-scale dataset tailored for multi-granularity region captioning. Unlike prior datasets that focus primarily on salient objects, URECA dataset ensures a unique and consistent mapping between regions and captions by incorporating a diverse set of objects, parts, and background elements. Central to this is a stage-wise data curation pipeline, where each stage incrementally refines region selection and caption generation. By leveraging Multimodal Large Language Models (MLLMs) at each stage, our pipeline produces distinctive and contextually grounded captions with improved accuracy and semantic diversity. Building upon this dataset, we present URECA, a novel captioning model designed to effectively encode multi-granularity regions. URECA maintains essential spatial properties such as position and shape through simple yet impactful modifications to existing MLLMs, enabling fine-grained and semantically rich region descriptions. Our approach introduces dynamic mask modeling and a high-resolution mask encoder to enhance caption uniqueness. Experiments show that URECA achieves state-of-the-art performance on URECA dataset and generalizes well to existing region-level captioning benchmarks.

Paper Structure

This paper contains 31 sections, 2 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Unique Region Caption Anything. We introduce URECA dataset, a novel region-level captioning dataset designed to ensure caption uniqueness and support multi-granularity regions. Each caption in our benchmark is uniquely mapped to its corresponding region, capturing distinctive attributes that differentiate it from surrounding areas. Moreover, we show that our proposed model trained on our dataset effectively generates unique captions for regions at any level of granularity.
  • Figure 2: Automated data curation pipeline of URECA dataset. Our pipeline consists of four key stages to generate unique captions for multi-granularity regions. In Stage 1, we construct a mask tree that captures hierarchical relationships between regions. Stage 2 generates short captions based on the parent node. Stage 3 aggregates captions from child nodes, and Stage 4 ensures that each node is assigned a unique caption. Best viewed in zoomed-in.
  • Figure 3: URECA architecture. URECA enables users to generate unique captions that describe distinctive attributes of any region. The mask encoder effectively encodes multi-granularity regions while preserving their identity. The mask token serves as a localizer, guiding the LLM to generate region-specific captions based on the image and query token.
  • Figure 4: Qualitative results of the URECA and comparison models kosmos-2omg-llava. Our model generates unique caption conditioned on multi-granularity regions.
  • Figure A: Qualitative results of the URECA and comparison models vip-llavaomg-llava. Our model generates unique caption conditioned on multi-granularity regions.
  • ...and 1 more figures