Text2Graph VPR: A Text-to-Graph Expert System for Explainable Place Recognition in Changing Environments
Saeideh Yousefzadeh, Hamidreza Pourreza
TL;DR
This work tackles Visual Place Recognition under changing appearance by proposing Text2Graph VPR, which converts image sequences into textual descriptions, parses them into scene graphs, and retrieves places using a hybrid of learned Graph Attention Network embeddings and Shortest-Path kernel-based structural similarity. The approach yields human-readable intermediate representations, supports zero-shot language-based localization, and demonstrates robustness across cross-condition and cross-city scenarios on Oxford RobotCar and MSLS. Although it may not always exceed pixel-based baselines in raw accuracy, its semantic-graph reasoning offers strong generalization, interpretability, and efficiency, making it well-suited for safety-critical and resource-constrained deployments. The work lays a foundation for knowledge-driven, multimodal localization, with promising directions toward integrating larger vision-language models and end-to-end training.
Abstract
Visual Place Recognition (VPR) in long-term deployment requires reasoning beyond pixel similarity: systems must make transparent, interpretable decisions that remain robust under lighting, weather and seasonal change. We present Text2Graph VPR, an explainable semantic localization system that converts image sequences into textual scene descriptions, parses those descriptions into structured scene graphs, and reasons over the resulting graphs to identify places. Scene graphs capture objects, attributes and pairwise relations; we aggregate per-frame graphs into a compact place representation and perform retrieval with a dual-similarity mechanism that fuses learned Graph Attention Network (GAT) embeddings and a Shortest-Path (SP) kernel for structural matching. This hybrid design enables both learned semantic matching and topology-aware comparison, and -- critically -- produces human-readable intermediate representations that support diagnostic analysis and improve transparency in the decision process. We validate the system on Oxford RobotCar and MSLS (Amman/San Francisco) benchmarks and demonstrate robust retrieval under severe appearance shifts, along with zero-shot operation using human textual queries. The results illustrate that semantic, graph-based reasoning is a viable and interpretable alternative for place recognition, particularly suited to safety-sensitive and resource-constrained settings.
