Table of Contents
Fetching ...

TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, Derek Hoiem

TL;DR

TextRegion tackles the gap between global image-text alignment and region-level understanding by integrating frozen image-text encoders with SAM2 segmentation masks to produce text-aligned region tokens. It introduces region-specific attention constraints and mask-guided pooling to transform region features into tokens compatible with text embeddings, enhanced by multi-resolution encoding and a global-patch suppression mechanism. The approach is training-free and compatible with multiple backbones (e.g., CLIP, SigLIP, Perception Encoder), achieving strong zero-shot performance in open-world semantic segmentation, and competitive results in zero-shot referring expression comprehension and multi-object grounding. By reframing dense prediction as region-level sparse classification, TextRegion offers a practical, scalable pathway to open-vocabulary region understanding using existing models.

Abstract

Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.

TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

TL;DR

TextRegion tackles the gap between global image-text alignment and region-level understanding by integrating frozen image-text encoders with SAM2 segmentation masks to produce text-aligned region tokens. It introduces region-specific attention constraints and mask-guided pooling to transform region features into tokens compatible with text embeddings, enhanced by multi-resolution encoding and a global-patch suppression mechanism. The approach is training-free and compatible with multiple backbones (e.g., CLIP, SigLIP, Perception Encoder), achieving strong zero-shot performance in open-world semantic segmentation, and competitive results in zero-shot referring expression comprehension and multi-object grounding. By reframing dense prediction as region-level sparse classification, TextRegion offers a practical, scalable pathway to open-vocabulary region understanding using existing models.

Abstract

Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.

Paper Structure

This paper contains 18 sections, 7 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Using feature maps from frozen image-text models and segment masks from SAM2, we generate text-aligned region tokens that can be directly applied to various downstream tasks.
  • Figure 2: Patch Value and Mask-based Attention Pooling: (b) shows the segment results based on the patch value, indicating that the patch values are aligned with visual-language semantics, but could be noisy. (c) is the resized mask for a specific region, which restricts the aggregation to patches within that region. (d) demonstrates that by attending only to region-related patches, we can obtain a text-aligned region token, effectively mitigating the influence of imprecise patch values.
  • Figure 3: TextRegion Framework. Mask Generation: We generate $R$ soft masks using SAM2, with values ranging from 0 to 1, where each mask corresponds to a distinct region in the input image. Patch Encoding: The image is encoded to obtain a multi-resolution feature map, which is fed into the final attention block of the frozen image-text models. See Sec. \ref{['sec:patch_encoding']} for details. Mask-based Attention Pooling: As illustrated in Fig. \ref{['fig:attention_values']}, we perform pooling based on the $R$ bilinearly downsampled masks. Prediction: Using the pooled text-aligned region tokens, we support both zero-shot region-sparse classification and dense prediction.
  • Figure 3: Multiple Object Grounding. The last three rows show results using interpreted queries. TextRegion demonstrates significant performance gains when given LLaVA-interpreted queries, outperforming baselines by a large margin.
  • Figure 4: Global Patches. The first row shows segmentation examples for complex images. Despite the difficulty, the model produces correct results. In contrast, the second row presents an easier case where the model fails to segment properly. (b) and (c) show the segmentation results before and after removing global patches, respectively. (d) presents the region masks generated by SAM2, which are used to compute the local similarity defined in Eq. \ref{['eq:local_similarity']}. (e) visualizes the local similarity of patches, where lower similarity indicates a higher likelihood of being a global patch. In this case, the model incorrectly classifies the bed as a cat due to the presence of many global patches in the bed area.
  • ...and 6 more figures