Table of Contents
Fetching ...

Text2Graph VPR: A Text-to-Graph Expert System for Explainable Place Recognition in Changing Environments

Saeideh Yousefzadeh, Hamidreza Pourreza

TL;DR

This work tackles Visual Place Recognition under changing appearance by proposing Text2Graph VPR, which converts image sequences into textual descriptions, parses them into scene graphs, and retrieves places using a hybrid of learned Graph Attention Network embeddings and Shortest-Path kernel-based structural similarity. The approach yields human-readable intermediate representations, supports zero-shot language-based localization, and demonstrates robustness across cross-condition and cross-city scenarios on Oxford RobotCar and MSLS. Although it may not always exceed pixel-based baselines in raw accuracy, its semantic-graph reasoning offers strong generalization, interpretability, and efficiency, making it well-suited for safety-critical and resource-constrained deployments. The work lays a foundation for knowledge-driven, multimodal localization, with promising directions toward integrating larger vision-language models and end-to-end training.

Abstract

Visual Place Recognition (VPR) in long-term deployment requires reasoning beyond pixel similarity: systems must make transparent, interpretable decisions that remain robust under lighting, weather and seasonal change. We present Text2Graph VPR, an explainable semantic localization system that converts image sequences into textual scene descriptions, parses those descriptions into structured scene graphs, and reasons over the resulting graphs to identify places. Scene graphs capture objects, attributes and pairwise relations; we aggregate per-frame graphs into a compact place representation and perform retrieval with a dual-similarity mechanism that fuses learned Graph Attention Network (GAT) embeddings and a Shortest-Path (SP) kernel for structural matching. This hybrid design enables both learned semantic matching and topology-aware comparison, and -- critically -- produces human-readable intermediate representations that support diagnostic analysis and improve transparency in the decision process. We validate the system on Oxford RobotCar and MSLS (Amman/San Francisco) benchmarks and demonstrate robust retrieval under severe appearance shifts, along with zero-shot operation using human textual queries. The results illustrate that semantic, graph-based reasoning is a viable and interpretable alternative for place recognition, particularly suited to safety-sensitive and resource-constrained settings.

Text2Graph VPR: A Text-to-Graph Expert System for Explainable Place Recognition in Changing Environments

TL;DR

This work tackles Visual Place Recognition under changing appearance by proposing Text2Graph VPR, which converts image sequences into textual descriptions, parses them into scene graphs, and retrieves places using a hybrid of learned Graph Attention Network embeddings and Shortest-Path kernel-based structural similarity. The approach yields human-readable intermediate representations, supports zero-shot language-based localization, and demonstrates robustness across cross-condition and cross-city scenarios on Oxford RobotCar and MSLS. Although it may not always exceed pixel-based baselines in raw accuracy, its semantic-graph reasoning offers strong generalization, interpretability, and efficiency, making it well-suited for safety-critical and resource-constrained deployments. The work lays a foundation for knowledge-driven, multimodal localization, with promising directions toward integrating larger vision-language models and end-to-end training.

Abstract

Visual Place Recognition (VPR) in long-term deployment requires reasoning beyond pixel similarity: systems must make transparent, interpretable decisions that remain robust under lighting, weather and seasonal change. We present Text2Graph VPR, an explainable semantic localization system that converts image sequences into textual scene descriptions, parses those descriptions into structured scene graphs, and reasons over the resulting graphs to identify places. Scene graphs capture objects, attributes and pairwise relations; we aggregate per-frame graphs into a compact place representation and perform retrieval with a dual-similarity mechanism that fuses learned Graph Attention Network (GAT) embeddings and a Shortest-Path (SP) kernel for structural matching. This hybrid design enables both learned semantic matching and topology-aware comparison, and -- critically -- produces human-readable intermediate representations that support diagnostic analysis and improve transparency in the decision process. We validate the system on Oxford RobotCar and MSLS (Amman/San Francisco) benchmarks and demonstrate robust retrieval under severe appearance shifts, along with zero-shot operation using human textual queries. The results illustrate that semantic, graph-based reasoning is a viable and interpretable alternative for place recognition, particularly suited to safety-sensitive and resource-constrained settings.

Paper Structure

This paper contains 28 sections, 4 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Overview of the proposed text-to-graph Visual Place Recognition framework. The pipeline consists of four main stages. From left to right (1) A vision–language model generates stable textual descriptions from sequential frames. (2) A large language model parses these descriptions into frame-level scene graphs, which are subsequently merged into a unified place-level graph. (3) The scene graphs are encoded into graph embeddings using a Graph Attention Network (GAT) with BERT-based node and edge features, trained using an InfoNCE contrastive loss on anchor–positive pairs. (4) During retrieval, the query graph is compared with database graphs using a dual-similarity strategy: semantic similarity via cosine distance between learned embeddings, and structural similarity via the Shortest-Path (SP) kernel. A dynamic fusion weight balances the two signals to produce the final retrieval score.
  • Figure 2: Example of the scene graph construction pipeline. (a) Input image from an urban environment. (b) Generated textual description containing object-centric and spatial details. (c) JSON-formatted scene graph produced by the prompt-guided parsing process. (d) Visualization of the resulting scene graph showing object nodes and spatial relationships.
  • Figure 3: Structural analysis of scene graphs in Oxford RobotCar and MSLS. From left to right: (a) Node Count vs. Density shows smaller, denser graphs in Oxford and larger, sparser graphs in MSLS. (b) MSLS graphs contain more nodes. (c) MSLS graphs have slightly higher average degree. (d) Density distribution further confirms Oxford’s higher compactness and MSLS’s sparsity. The plots show that Oxford graphs are smaller and denser. In contrast, MSLS graphs are larger but sparser.