Table of Contents
Fetching ...

Reasoning to Attend: Try to Understand How <SEG> Token Works

Rui Qian, Xin Yin, Dejing Dou

TL;DR

This work investigates how the <SEG> token grounds textual prompts to visual space in large multimodal models by visualizing semantic similarity maps between the token and image patches. It introduces READ, a modular framework with a Similarity as Points (SasP) module that converts highly activated similarity points into differentiable prompts for the SAM decoder, enabling the model to reason about where to attend and how to attend. READ leverages a frozen LLaVA encoder and a SAM mask decoder, trained with a joint objective combining text and mask losses, and uses a Discrete-to-Continuous (DtoC) interpolation to backpropagate through the attention cues. Across ReasonSeg and RefCOCO(+/g), READ achieves state-of-the-art gains, including substantial improvements under false-premise scenarios (FP-RefCOCO(+/g)), demonstrating robust and scalable improvements in reasoning segmentation. The approach is plug-and-play with existing <SEG>-like pipelines, offering a practical path to enhance multimodal grounding and interpretability in vision-language systems.

Abstract

Current Large Multimodal Models (LMMs) empowered visual grounding typically rely on $\texttt{<SEG>}$ tokens as a text prompt to jointly optimize the vision-language model (e.g., LLaVA) and the downstream task-specific model (e.g., SAM). However, we observe that little research has looked into how it works.In this work, we first visualize the similarity maps, which are obtained by computing the semantic similarity between the $\texttt{<SEG>}$ token and the image token embeddings derived from the last hidden layer in both the LLaVA encoder and SAM decoder. Intriguingly, we have found that a striking consistency holds in terms of activation responses in the similarity map, which reveals that what the $\texttt{<SEG>}$ token contributes to is semantic similarity within image-text pairs. Specifically, the $\texttt{<SEG>}$ token, a placeholder expanded in text vocabulary, extensively queries among individual tokenized image patches to match the semantics of an object from text to the paired image, while the Large Language Models (LLMs) are being fine-tuned. Upon the above findings, we present READ, which facilitates LMMs' resilient $\textbf{REA}$soning capability of where to atten$\textbf{D}$ under the guidance of highly activated points borrowed from similarity maps. Remarkably, READ features an intuitive design, Similarity as Points module (SasP), which can be seamlessly applied to $\texttt{<SEG>}$-like paradigms in a plug-and-play fashion. Also, extensive experiments have been conducted on ReasonSeg and RefCOCO(+/g) datasets. To validate whether READ suffers from catastrophic forgetting of previous skills after fine-tuning, we further assess its generation ability on an augmented FP-RefCOCO(+/g) dataset. All codes and models are publicly available at https://github.com/rui-qian/READ.

Reasoning to Attend: Try to Understand How <SEG> Token Works

TL;DR

This work investigates how the <SEG> token grounds textual prompts to visual space in large multimodal models by visualizing semantic similarity maps between the token and image patches. It introduces READ, a modular framework with a Similarity as Points (SasP) module that converts highly activated similarity points into differentiable prompts for the SAM decoder, enabling the model to reason about where to attend and how to attend. READ leverages a frozen LLaVA encoder and a SAM mask decoder, trained with a joint objective combining text and mask losses, and uses a Discrete-to-Continuous (DtoC) interpolation to backpropagate through the attention cues. Across ReasonSeg and RefCOCO(+/g), READ achieves state-of-the-art gains, including substantial improvements under false-premise scenarios (FP-RefCOCO(+/g)), demonstrating robust and scalable improvements in reasoning segmentation. The approach is plug-and-play with existing <SEG>-like pipelines, offering a practical path to enhance multimodal grounding and interpretability in vision-language systems.

Abstract

Current Large Multimodal Models (LMMs) empowered visual grounding typically rely on tokens as a text prompt to jointly optimize the vision-language model (e.g., LLaVA) and the downstream task-specific model (e.g., SAM). However, we observe that little research has looked into how it works.In this work, we first visualize the similarity maps, which are obtained by computing the semantic similarity between the token and the image token embeddings derived from the last hidden layer in both the LLaVA encoder and SAM decoder. Intriguingly, we have found that a striking consistency holds in terms of activation responses in the similarity map, which reveals that what the token contributes to is semantic similarity within image-text pairs. Specifically, the token, a placeholder expanded in text vocabulary, extensively queries among individual tokenized image patches to match the semantics of an object from text to the paired image, while the Large Language Models (LLMs) are being fine-tuned. Upon the above findings, we present READ, which facilitates LMMs' resilient soning capability of where to atten under the guidance of highly activated points borrowed from similarity maps. Remarkably, READ features an intuitive design, Similarity as Points module (SasP), which can be seamlessly applied to -like paradigms in a plug-and-play fashion. Also, extensive experiments have been conducted on ReasonSeg and RefCOCO(+/g) datasets. To validate whether READ suffers from catastrophic forgetting of previous skills after fine-tuning, we further assess its generation ability on an augmented FP-RefCOCO(+/g) dataset. All codes and models are publicly available at https://github.com/rui-qian/READ.

Paper Structure

This paper contains 41 sections, 15 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Qualitative analysis of the <SEG> token on the ReasonSeg train set. Points derived from $(c)$ serve as prompts with original SAM in $(a)$. Text "antler" with image token from CLIP is in $(b)$. The similarity between the <SEG> token and image token embeddings stemming from the last hidden layer is obtained by Eq.\ref{['eq:similarity']}, w.r.t. LLaVA encoder in $(c)$ and SAM decoder in $(d)$. The consistency observed in $(b)$, $(c)$, $(d)$ indicates that the <SEG> token in LMMs learns semantics similar to direct mentions in text. Refer to Appendix \ref{['sup:additional_analysis']} for more illustrations.
  • Figure 2: Overview of our proposed READ. The hidden state outputs with respect to the <SEG> token and image tokens are derived from the LLaVA encoder for similarity as points, before being fed into the prompt encoder for sparse embedding. To inform the model where to "attend" when reasoning, we apply a Gaussian-like weighted average interpolation to transform discrete points into continuous ones.
  • Figure 3: Visual comparison among READ (ours) and prior works on the ReasonSeg val set. Refer to Appendix \ref{['sup:additional_qualitative_results']} for more illustrations.
  • Figure 4: Qualitative analysis of the <SEG> token on the ReasonSeg val set. The $1^{st}$, $2^{nd}$, and $3^{rd}$ columns of $(a)$, $(b)$, and $(c)$ are LISA, SESAME, and READ (Ours) for comparisons, respectively. Points derived from $(a)$ serve as prompts with original SAM in $(c)$.
  • Figure 5: Showcase of complex reasoning and world knowledge.
  • ...and 3 more figures