Tighnari: Multi-modal Plant Species Prediction Based on Hierarchical Cross-Attention Using Graph-Based and Vision Backbone-Extracted Features
Haixu Liu, Penghao Jiang, Zerui Tao, Muyan Wan, Qiuzhuang Sun
TL;DR
Tighnari addresses multi-modal plant species prediction in spatiotemporal contexts by fusing graph-based features, temporal cubes processed with a Swin-Transformer backbone, and image/tabular data through a Hierarchical Cross-Attention Fusion Mechanism. The framework introduces a Graph Feature Vector (GFV) built from a Survey ID graph, employs HCAM to progressively fuse modalities with density-aware weighting, and uses MixUp with ten-fold cross-fusion training plus Threshold Top-K post-processing to handle extreme label sparsity. Ablation studies show that GFV and HCAM improve validation losses and that ten-fold cross fusion plus post-processing substantially boosts leaderboard performance, achieving a private score around 0.365 on GeoLifeCLEF 2024 tasks. The work advances practical multi-modal fusion for biodiversity monitoring, offering scalable strategies for handling imbalanced, high-dimensional ecological data and providing a blueprint for future weakly supervised and graph-based approaches.
Abstract
Predicting plant species composition in specific spatiotemporal contexts plays an important role in biodiversity management and conservation, as well as in improving species identification tools. Our work utilizes 88,987 plant survey records conducted in specific spatiotemporal contexts across Europe. We also use the corresponding satellite images, time series data, climate time series, and other rasterized environmental data such as land cover, human footprint, bioclimatic, and soil variables as training data to train the model to predict the outcomes of 4,716 plant surveys. We propose a feature construction and result correction method based on the graph structure. Through comparative experiments, we select the best-performing backbone networks for feature extraction in both temporal and image modalities. In this process, we built a backbone network based on the Swin-Transformer Block for extracting temporal Cubes features. We then design a hierarchical cross-attention mechanism capable of robustly fusing features from multiple modalities. During training, we adopt a 10-fold cross-fusion method based on fine-tuning and use a Threshold Top-K method for post-processing. Ablation experiments demonstrate the improvements in model performance brought by our proposed solution pipeline.
