Table of Contents
Fetching ...

C2F-Space: Coarse-to-Fine Space Grounding for Spatial Instructions using Vision-Language Models

Nayoung Oh, Dohyun Kim, Junhyeong Bang, Rohan Paul, Daehyung Park

TL;DR

C2F-Space tackles the challenge of grounding complex spatial language in visual scenes by introducing a two-stage coarse-to-fine framework that first generates a spatially consistent region with grid-guided prompting and iterative validation, then refines the region with superpixel-based refinement to align with local context. The approach leverages vision-language models (VLMs) guided by structured prompts, Grounded-SAM for object masking, and a validation loop to ensure physical feasibility and semantic alignment. A new space-grounding benchmark of 350 problems demonstrates that C2F-Space outperforms five strong baselines in both success rate and IoU, with ablations confirming the two components’ synergistic effect. The work also validates practical utility through simulated robotic pick-and-place tasks, highlighting the method’s potential for real-world robotic instruction following and fine-grained spatial reasoning.

Abstract

Space grounding refers to localizing a set of spatial references described in natural language instructions. Traditional methods often fail to account for complex reasoning -- such as distance, geometry, and inter-object relationships -- while vision-language models (VLMs), despite strong reasoning abilities, struggle to produce a fine-grained region of outputs. To overcome these limitations, we propose C2F-Space, a novel coarse-to-fine space-grounding framework that (i) estimates an approximated yet spatially consistent region using a VLM, then (ii) refines the region to align with the local environment through superpixelization. For the coarse estimation, we design a grid-based visual-grounding prompt with a propose-validate strategy, maximizing VLM's spatial understanding and yielding physically and semantically valid canonical region (i.e., ellipses). For the refinement, we locally adapt the region to surrounding environment without over-relaxed to free space. We construct a new space-grounding benchmark and compare C2F-Space with five state-of-the-art baselines using success rate and intersection-over-union. Our C2F-Space significantly outperforms all baselines. Our ablation study confirms the effectiveness of each module in the two-step process and their synergistic effect of the combined framework. We finally demonstrate the applicability of C2F-Space to simulated robotic pick-and-place tasks.

C2F-Space: Coarse-to-Fine Space Grounding for Spatial Instructions using Vision-Language Models

TL;DR

C2F-Space tackles the challenge of grounding complex spatial language in visual scenes by introducing a two-stage coarse-to-fine framework that first generates a spatially consistent region with grid-guided prompting and iterative validation, then refines the region with superpixel-based refinement to align with local context. The approach leverages vision-language models (VLMs) guided by structured prompts, Grounded-SAM for object masking, and a validation loop to ensure physical feasibility and semantic alignment. A new space-grounding benchmark of 350 problems demonstrates that C2F-Space outperforms five strong baselines in both success rate and IoU, with ablations confirming the two components’ synergistic effect. The work also validates practical utility through simulated robotic pick-and-place tasks, highlighting the method’s potential for real-world robotic instruction following and fine-grained spatial reasoning.

Abstract

Space grounding refers to localizing a set of spatial references described in natural language instructions. Traditional methods often fail to account for complex reasoning -- such as distance, geometry, and inter-object relationships -- while vision-language models (VLMs), despite strong reasoning abilities, struggle to produce a fine-grained region of outputs. To overcome these limitations, we propose C2F-Space, a novel coarse-to-fine space-grounding framework that (i) estimates an approximated yet spatially consistent region using a VLM, then (ii) refines the region to align with the local environment through superpixelization. For the coarse estimation, we design a grid-based visual-grounding prompt with a propose-validate strategy, maximizing VLM's spatial understanding and yielding physically and semantically valid canonical region (i.e., ellipses). For the refinement, we locally adapt the region to surrounding environment without over-relaxed to free space. We construct a new space-grounding benchmark and compare C2F-Space with five state-of-the-art baselines using success rate and intersection-over-union. Our C2F-Space significantly outperforms all baselines. Our ablation study confirms the effectiveness of each module in the two-step process and their synergistic effect of the combined framework. We finally demonstrate the applicability of C2F-Space to simulated robotic pick-and-place tasks.

Paper Structure

This paper contains 12 sections, 2 equations, 12 figures.

Figures (12)

  • Figure 1: Illustration of the two-stage space grounding result produced by the proposed C2F-Space . The grid-guided prompt enables the VLM to generate a coarse region proposal (e.g., an ellipsoid) through spatially multiplicative reasoning. A superpixel-based enhancement process then refines this proposal into a fine-grained spatial mask.
  • Figure 2: (a) Overall framework of C2F-Space. Given an instruction $\Lambda$ and an image $I$, the space‑reasoning module proposes a candidate region $\tilde{\mathcal{M}}_{\Lambda}$, and the subsequent space‑refinement module adapts this proposal to produce the precise region $\mathcal{M}_{\Lambda}$. (b) The space-reasoning module iteratively proposes and validates a canonical spatial region $\tilde{\mathcal{M}}_{\Lambda}$. At each iteration $k$, the VLM-based proper predicts an elliptical region guided by the textual prompt $\nu^{\Lambda}_{k}$ and the visual prompt $\nu^{I}_{k}$, where the image $I$ is inpainted by a spatial grid. Then, two validators subsequently ensure that the proposed region is both collision-free and semantically consistent with the instruction $\Lambda$. Note that the object identification module runs once beforehand to generate object masks used during the validation process.
  • Figure 3: A capture of the region proposal prompt for $k=0$
  • Figure 4: A capture of the physical validator prompt
  • Figure 5: A capture of the semantic validator prompt
  • ...and 7 more figures