Table of Contents
Fetching ...

LINGO-Space: Language-Conditioned Incremental Grounding for Space

Dohyun Kim, Nayoung Oh, Deokmin Hwang, Daehyung Park

TL;DR

LINGO-Space addresses the problem of space grounding, the localization of spatial references in composite natural-language instructions, by learning a probabilistic, language-conditioned space representation. It combines a scene-graph–driven grounding framework with an LLM-guided semantic parser to decompose instructions into relation tuples and incrementally update a mixture of configurable polar distributions that model the target region in $\mathbb{R}^2$. The key innovations are the use of instance-wise polar distributions with parameters $\bm{\theta}$, a GPS-layer–based estimation network, and an LLM-based parser that robustly handles diverse linguistic structures. Empirical results on 12 single-expression and composite tasks, plus real-world Spot navigation, show superior grounding accuracy, better generalization to unseen objects and predicates, and scalable handling of multiple referring expressions, indicating strong practical impact for language-guided robotic manipulation and navigation.

Abstract

We aim to solve the problem of spatially localizing composite instructions referring to space: space grounding. Compared to current instance grounding, space grounding is challenging due to the ill-posedness of identifying locations referred to by discrete expressions and the compositional ambiguity of referring expressions. Therefore, we propose a novel probabilistic space-grounding methodology (LINGO-Space) that accurately identifies a probabilistic distribution of space being referred to and incrementally updates it, given subsequent referring expressions leveraging configurable polar distributions. Our evaluations show that the estimation using polar distributions enables a robot to ground locations successfully through $20$ table-top manipulation benchmark tests. We also show that updating the distribution helps the grounding method accurately narrow the referring space. We finally demonstrate the robustness of the space grounding with simulated manipulation and real quadruped robot navigation tasks. Code and videos are available at https://lingo-space.github.io.

LINGO-Space: Language-Conditioned Incremental Grounding for Space

TL;DR

LINGO-Space addresses the problem of space grounding, the localization of spatial references in composite natural-language instructions, by learning a probabilistic, language-conditioned space representation. It combines a scene-graph–driven grounding framework with an LLM-guided semantic parser to decompose instructions into relation tuples and incrementally update a mixture of configurable polar distributions that model the target region in . The key innovations are the use of instance-wise polar distributions with parameters , a GPS-layer–based estimation network, and an LLM-based parser that robustly handles diverse linguistic structures. Empirical results on 12 single-expression and composite tasks, plus real-world Spot navigation, show superior grounding accuracy, better generalization to unseen objects and predicates, and scalable handling of multiple referring expressions, indicating strong practical impact for language-guided robotic manipulation and navigation.

Abstract

We aim to solve the problem of spatially localizing composite instructions referring to space: space grounding. Compared to current instance grounding, space grounding is challenging due to the ill-posedness of identifying locations referred to by discrete expressions and the compositional ambiguity of referring expressions. Therefore, we propose a novel probabilistic space-grounding methodology (LINGO-Space) that accurately identifies a probabilistic distribution of space being referred to and incrementally updates it, given subsequent referring expressions leveraging configurable polar distributions. Our evaluations show that the estimation using polar distributions enables a robot to ground locations successfully through table-top manipulation benchmark tests. We also show that updating the distribution helps the grounding method accurately narrow the referring space. We finally demonstrate the robustness of the space grounding with simulated manipulation and real quadruped robot navigation tasks. Code and videos are available at https://lingo-space.github.io.
Paper Structure (18 sections, 9 equations, 5 figures, 3 tables)

This paper contains 18 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An illustration of incremental space grounding in the navigation task. Our method, LINGO-Space, identifies the distribution of the target location indicated by a natural language instruction with referring expressions.
  • Figure 2: The overall architecture of LINGO-Space on a tabletop manipulation task. Given a composite instruction, a graph generator provides a scene graph. A semantic parser decomposes the instruction into a structured form of relation-embedding tuples $r^{(i)}$, where $i\in\{1, ..., M\}$. Finally, a spatial-distribution estimator incrementally updates a probabilistic distribution of locations satisfying spatial constraints encoded in the embedding tuples.
  • Figure 3: Architecture of the spatial-distribution estimator. Given the graph representation of the problem description, the network predicts instance-wise polar distributions, updating the internal model with the previous state.
  • Figure 4: Qualitative evaluation with LINGO-Space, CLIPort, and ParaGon. Grey boxes represent the object each $i$-th phrase refers to, while red dots and blue dots represent the ground-truth and the prediction, respectively. We plot $100$ particles for the ParaGon's prediction result. The results demonstrate that LINGO-Space is capable of accurately identifying the space referred to by a composite instruction by narrowing down the space.
  • Figure 5: Grounding performance on the increasing number of expressions. Each graph uses a distinct score metric: (left) the binary success score as ParaGon benchmark and (right) the success score.