Multi-Modal Feature Fusion for Spatial Morphology Analysis of Traditional Villages via Hierarchical Graph Neural Networks
Jiaxin Zhang, Zehong Zhu, Junye Deng, Yunqin Li, and Bowen Wang
TL;DR
This work tackles the erosion of traditional village spatial morphology amid urbanization by proposing a hierarchical graph neural network (HGNN) that fuses multi-source data (images, text, and independent socio-geographic factors) to classify 17 village subtypes across three groups: Settlement Landscape Spatial Structure (S), Settlement and Parcel Morphology Patterns (V), and Settlement Road Network Patterns (R). The pipeline uses frozen CLIP encoders for image and text features, a learnable expansion layer for multi-fact data, and a two-stage HGNN (GCN followed by GAT) with static input edges and dynamic communication edges, incorporating a fusion weight $\beta=0.6$ and a relation pooling mechanism for joint subtype training. Key contributions include a multi-source village morphology dataset (583 villages in Jiangxi), a hierarchical graph-based fusion framework, and a joint training strategy across 17 subtypes that improves robustness on small datasets. The approach achieves state-of-the-art performance across S, V, and R groups, with an average accuracy around $0.85$, and provides interpretable attention patterns that reveal how image, text, and sociogeographic signals interact to shape morphology. Overall, the method offers a scalable, data-driven tool for mapping village spatial patterns, supporting heritage conservation, village planning, and revitalization efforts in the context of digital villages and smart planning initiatives.
Abstract
Villages areas hold significant importance in the study of human-land relationships. However, with the advancement of urbanization, the gradual disappearance of spatial characteristics and the homogenization of landscapes have emerged as prominent issues. Existing studies primarily adopt a single-disciplinary perspective to analyze villages spatial morphology and its influencing factors, relying heavily on qualitative analysis methods. These efforts are often constrained by the lack of digital infrastructure and insufficient data. To address the current research limitations, this paper proposes a Hierarchical Graph Neural Network (HGNN) model that integrates multi-source data to conduct an in-depth analysis of villages spatial morphology. The framework includes two types of nodes-input nodes and communication nodes-and two types of edges-static input edges and dynamic communication edges. By combining Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), the proposed model efficiently integrates multimodal features under a two-stage feature update mechanism. Additionally, based on existing principles for classifying villages spatial morphology, the paper introduces a relational pooling mechanism and implements a joint training strategy across 17 subtypes. Experimental results demonstrate that this method achieves significant performance improvements over existing approaches in multimodal fusion and classification tasks. Additionally, the proposed joint optimization of all sub-types lifts mean accuracy/F1 from 0.71/0.83 (independent models) to 0.82/0.90, driven by a 6% gain for parcel tasks. Our method provides scientific evidence for exploring villages spatial patterns and generative logic.
