Table of Contents
Fetching ...

UGround: Towards Unified Visual Grounding with Unrolled Transformers

Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, Dejing Dou

TL;DR

UGround tackles the fragmentation of visual grounding tasks by unifying them under a single framework that dynamically selects intermediate transformer layers and uses a similarity-based mask as a prompt to SAM. The approach, built on Stochastic Skip Connections and Mask as Prompt, produces explicit spatial cues and enables end-to-end supervision, including on zero-target (empty) cases. Empirical results across ReasonSeg, RefCOCO(+/g), and gRefCOCO demonstrate state-of-the-art gains and improved robustness, with code released for community use. This work advances unified visual grounding by integrating attribute variation, dynamic connectivity, and explicit spatial prompting into a single, learnable system.

Abstract

We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as ``mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as ``\texttt{<SEG>} as prompt''. UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of \texttt{<SEG>} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (\eg, coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt{<SEG>} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (\eg, SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt{<SEG>} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All codes and models are publicly available at \href{https://github.com/rui-qian/UGround}{https://github.com/rui-qian/UGround}.

UGround: Towards Unified Visual Grounding with Unrolled Transformers

TL;DR

UGround tackles the fragmentation of visual grounding tasks by unifying them under a single framework that dynamically selects intermediate transformer layers and uses a similarity-based mask as a prompt to SAM. The approach, built on Stochastic Skip Connections and Mask as Prompt, produces explicit spatial cues and enables end-to-end supervision, including on zero-target (empty) cases. Empirical results across ReasonSeg, RefCOCO(+/g), and gRefCOCO demonstrate state-of-the-art gains and improved robustness, with code released for community use. This work advances unified visual grounding by integrating attribute variation, dynamic connectivity, and explicit spatial prompting into a single, learnable system.

Abstract

We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as ``mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as ``\texttt{<SEG>} as prompt''. UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of \texttt{<SEG>} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (\eg, coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt{<SEG>} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (\eg, SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt{<SEG>} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All codes and models are publicly available at \href{https://github.com/rui-qian/UGround}{https://github.com/rui-qian/UGround}.

Paper Structure

This paper contains 15 sections, 9 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Prior works typically use <SEG> token embeddings from the last hidden layer as the prompt in (a). In contrast, we leverage the similarity map in Eq. (\ref{['eq:similarity_map']}), generated from <SEG> and image token embeddings across dynamically selected transformer layers, as the prompt in (b). Dynamic layer selection allows the similarity map to slide across transformer layers in (c). Due to the sequential nature of transformers, $\ell_L-\ell_k$ layers are skipped, enabling a direct connection with SAM in a single forward step $\mathcal{T}_t$ ("skip-connection-like"). Across multiple forward steps $\mathcal{T}_1$ to $\mathcal{T}_T$, the connectivity to SAM varies dynamically, with only one path activated at each step ("dropout-like").
  • Figure 2: (\ref{['fig:sub1']}) shows that the dynamic layer selection strategy outperforms the fixed last hidden layer strategy across all middle layers. (\ref{['fig:sub2']}) plots the loss of similarity maps against the soft ground-truth mask, demonstrating faster convergence at intermediate layers.
  • Figure 3: Overview of our proposed UGround. Central to UGround is Policy-Prompted Masking (PPM), which stochastically selects layer $\ell^*$ among "Unrolled" transformers from a policy distribution $\pi_\theta(\cdot \mid \mathcal{H}_{t^*})$ at which it connects to the vision model (e.g., SAM) in a skip-connection fashion. Given the layer $\ell^*$, MasP uses the similarity map as a soft logit mask to prompt SAM for mask generation, wherein we advance visual grounding within a "Unified" paradigm from an attribute perspective.
  • Figure 4: The UGround code repository.
  • Figure 5: The UGround dashboard.
  • ...and 4 more figures