Table of Contents
Fetching ...

Multi-Modal Feature Fusion for Spatial Morphology Analysis of Traditional Villages via Hierarchical Graph Neural Networks

Jiaxin Zhang, Zehong Zhu, Junye Deng, Yunqin Li, and Bowen Wang

TL;DR

This work tackles the erosion of traditional village spatial morphology amid urbanization by proposing a hierarchical graph neural network (HGNN) that fuses multi-source data (images, text, and independent socio-geographic factors) to classify 17 village subtypes across three groups: Settlement Landscape Spatial Structure (S), Settlement and Parcel Morphology Patterns (V), and Settlement Road Network Patterns (R). The pipeline uses frozen CLIP encoders for image and text features, a learnable expansion layer for multi-fact data, and a two-stage HGNN (GCN followed by GAT) with static input edges and dynamic communication edges, incorporating a fusion weight $\beta=0.6$ and a relation pooling mechanism for joint subtype training. Key contributions include a multi-source village morphology dataset (583 villages in Jiangxi), a hierarchical graph-based fusion framework, and a joint training strategy across 17 subtypes that improves robustness on small datasets. The approach achieves state-of-the-art performance across S, V, and R groups, with an average accuracy around $0.85$, and provides interpretable attention patterns that reveal how image, text, and sociogeographic signals interact to shape morphology. Overall, the method offers a scalable, data-driven tool for mapping village spatial patterns, supporting heritage conservation, village planning, and revitalization efforts in the context of digital villages and smart planning initiatives.

Abstract

Villages areas hold significant importance in the study of human-land relationships. However, with the advancement of urbanization, the gradual disappearance of spatial characteristics and the homogenization of landscapes have emerged as prominent issues. Existing studies primarily adopt a single-disciplinary perspective to analyze villages spatial morphology and its influencing factors, relying heavily on qualitative analysis methods. These efforts are often constrained by the lack of digital infrastructure and insufficient data. To address the current research limitations, this paper proposes a Hierarchical Graph Neural Network (HGNN) model that integrates multi-source data to conduct an in-depth analysis of villages spatial morphology. The framework includes two types of nodes-input nodes and communication nodes-and two types of edges-static input edges and dynamic communication edges. By combining Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), the proposed model efficiently integrates multimodal features under a two-stage feature update mechanism. Additionally, based on existing principles for classifying villages spatial morphology, the paper introduces a relational pooling mechanism and implements a joint training strategy across 17 subtypes. Experimental results demonstrate that this method achieves significant performance improvements over existing approaches in multimodal fusion and classification tasks. Additionally, the proposed joint optimization of all sub-types lifts mean accuracy/F1 from 0.71/0.83 (independent models) to 0.82/0.90, driven by a 6% gain for parcel tasks. Our method provides scientific evidence for exploring villages spatial patterns and generative logic.

Multi-Modal Feature Fusion for Spatial Morphology Analysis of Traditional Villages via Hierarchical Graph Neural Networks

TL;DR

This work tackles the erosion of traditional village spatial morphology amid urbanization by proposing a hierarchical graph neural network (HGNN) that fuses multi-source data (images, text, and independent socio-geographic factors) to classify 17 village subtypes across three groups: Settlement Landscape Spatial Structure (S), Settlement and Parcel Morphology Patterns (V), and Settlement Road Network Patterns (R). The pipeline uses frozen CLIP encoders for image and text features, a learnable expansion layer for multi-fact data, and a two-stage HGNN (GCN followed by GAT) with static input edges and dynamic communication edges, incorporating a fusion weight and a relation pooling mechanism for joint subtype training. Key contributions include a multi-source village morphology dataset (583 villages in Jiangxi), a hierarchical graph-based fusion framework, and a joint training strategy across 17 subtypes that improves robustness on small datasets. The approach achieves state-of-the-art performance across S, V, and R groups, with an average accuracy around , and provides interpretable attention patterns that reveal how image, text, and sociogeographic signals interact to shape morphology. Overall, the method offers a scalable, data-driven tool for mapping village spatial patterns, supporting heritage conservation, village planning, and revitalization efforts in the context of digital villages and smart planning initiatives.

Abstract

Villages areas hold significant importance in the study of human-land relationships. However, with the advancement of urbanization, the gradual disappearance of spatial characteristics and the homogenization of landscapes have emerged as prominent issues. Existing studies primarily adopt a single-disciplinary perspective to analyze villages spatial morphology and its influencing factors, relying heavily on qualitative analysis methods. These efforts are often constrained by the lack of digital infrastructure and insufficient data. To address the current research limitations, this paper proposes a Hierarchical Graph Neural Network (HGNN) model that integrates multi-source data to conduct an in-depth analysis of villages spatial morphology. The framework includes two types of nodes-input nodes and communication nodes-and two types of edges-static input edges and dynamic communication edges. By combining Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), the proposed model efficiently integrates multimodal features under a two-stage feature update mechanism. Additionally, based on existing principles for classifying villages spatial morphology, the paper introduces a relational pooling mechanism and implements a joint training strategy across 17 subtypes. Experimental results demonstrate that this method achieves significant performance improvements over existing approaches in multimodal fusion and classification tasks. Additionally, the proposed joint optimization of all sub-types lifts mean accuracy/F1 from 0.71/0.83 (independent models) to 0.82/0.90, driven by a 6% gain for parcel tasks. Our method provides scientific evidence for exploring villages spatial patterns and generative logic.

Paper Structure

This paper contains 15 sections, 13 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Multi-source data used in this research.
  • Figure 2: The study area in this research.
  • Figure 3: Principles of related multi-facts data for a subtype prediction. Note that, image and text data are not associated with any subtype.
  • Figure 4: Pipeline of proposed method. It uses a frozen image encoder and a text encoder to extract features from images and text. A learnable feature expansion module aligns the dimensions of multi-fact data with those of the image and text. A Graph Neural Network (GNN) propagates information across data types in a hierarchical structure. Finally, a relation pooling operation is applied, followed by a linear layer to predict all spatial morphology subtypes.
  • Figure 5: Overall accuracy across all subtypes of village spatial morphology is presented. The results for S types, V types, and R types are represented in gradient red, grey, and gold, respectively. The average (Avg) accuracy for all subtypes is depicted in light purple. Additionally, dotted lines indicate the best and worst predicted types, respectively.
  • ...and 4 more figures