Table of Contents
Fetching ...

Where Do We Go from Here? Multi-scale Allocentric Relational Inference from Natural Spatial Descriptions

Tzuf Paz-Argaman, Sayali Kulkarni, John Palowitch, Jason Baldridge, Reut Tsarfaty

TL;DR

The paper introduces the Rendezvous (RVS) task and a 10,404-example dataset of survey-knowledge geospatial instructions, designed to require allocentric, non-sequential reasoning over a dense urban map. It builds a map-grounded framework using an OpenStreetMap-derived graph and S2-Cell coordinates, and proposes an encoder-decoder model (T5-based) with a graph-grounded world representation (T5+Graph) evaluated under a challenging zero-shot city split. Results show a substantial gap between current models and human performance, especially in unseen environments, underscoring the need for spatially aware language models and multimodal grounding. The work highlights directions for future research, including spatial large language models, integrating visual cues, and developing richer groundings to improve generalization in geospatial instruction understanding.

Abstract

When communicating routes in natural language, the concept of acquired spatial knowledge is crucial for geographic information retrieval (GIR) and in spatial cognitive research. However, NLP navigation studies often overlook the impact of such acquired knowledge on textual descriptions. Current navigation studies concentrate on egocentric local descriptions (e.g., `it will be on your right') that require reasoning over the agent's local perception. These instructions are typically given as a sequence of steps, with each action-step explicitly mentioning and being followed by a landmark that the agent can use to verify they are on the right path (e.g., `turn right and then you will see...'). In contrast, descriptions based on knowledge acquired through a map provide a complete view of the environment and capture its overall structure. These instructions (e.g., `it is south of Central Park and a block north of a police station') are typically non-sequential, contain allocentric relations, with multiple spatial relations and implicit actions, without any explicit verification. This paper introduces the Rendezvous (RVS) task and dataset, which includes 10,404 examples of English geospatial instructions for reaching a target location using map-knowledge. Our analysis reveals that RVS exhibits a richer use of spatial allocentric relations, and requires resolving more spatial relations simultaneously compared to previous text-based navigation benchmarks.

Where Do We Go from Here? Multi-scale Allocentric Relational Inference from Natural Spatial Descriptions

TL;DR

The paper introduces the Rendezvous (RVS) task and a 10,404-example dataset of survey-knowledge geospatial instructions, designed to require allocentric, non-sequential reasoning over a dense urban map. It builds a map-grounded framework using an OpenStreetMap-derived graph and S2-Cell coordinates, and proposes an encoder-decoder model (T5-based) with a graph-grounded world representation (T5+Graph) evaluated under a challenging zero-shot city split. Results show a substantial gap between current models and human performance, especially in unseen environments, underscoring the need for spatially aware language models and multimodal grounding. The work highlights directions for future research, including spatial large language models, integrating visual cues, and developing richer groundings to improve generalization in geospatial instruction understanding.

Abstract

When communicating routes in natural language, the concept of acquired spatial knowledge is crucial for geographic information retrieval (GIR) and in spatial cognitive research. However, NLP navigation studies often overlook the impact of such acquired knowledge on textual descriptions. Current navigation studies concentrate on egocentric local descriptions (e.g., `it will be on your right') that require reasoning over the agent's local perception. These instructions are typically given as a sequence of steps, with each action-step explicitly mentioning and being followed by a landmark that the agent can use to verify they are on the right path (e.g., `turn right and then you will see...'). In contrast, descriptions based on knowledge acquired through a map provide a complete view of the environment and capture its overall structure. These instructions (e.g., `it is south of Central Park and a block north of a police station') are typically non-sequential, contain allocentric relations, with multiple spatial relations and implicit actions, without any explicit verification. This paper introduces the Rendezvous (RVS) task and dataset, which includes 10,404 examples of English geospatial instructions for reaching a target location using map-knowledge. Our analysis reveals that RVS exhibits a richer use of spatial allocentric relations, and requires resolving more spatial relations simultaneously compared to previous text-based navigation benchmarks.
Paper Structure (36 sections, 6 figures, 8 tables)

This paper contains 36 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: An illustration example from the RVS dataset. The RVS input consists of (1) a bird's-eye instruction of the goal location (shown at the bottom), (2) a starting point (green marker), and (3) a map representation of the environment. The output is the goal (red marker).
  • Figure 2: The RVS instructions are collected over three cities (a–c).
  • Figure 3: The RVS model based on a T5 transformer and a graph representation of the environment.
  • Figure 4: Example of Multiple Validations: (i) starting point (green marker), (ii) goal (red marker), and (iii) predicted goal by participants (black markers).
  • Figure 5: Participant Interface: the instruction writing task.
  • ...and 1 more figures