Tighnari: Multi-modal Plant Species Prediction Based on Hierarchical Cross-Attention Using Graph-Based and Vision Backbone-Extracted Features

Haixu Liu; Penghao Jiang; Zerui Tao; Muyan Wan; Qiuzhuang Sun

Tighnari: Multi-modal Plant Species Prediction Based on Hierarchical Cross-Attention Using Graph-Based and Vision Backbone-Extracted Features

Haixu Liu, Penghao Jiang, Zerui Tao, Muyan Wan, Qiuzhuang Sun

TL;DR

Tighnari addresses multi-modal plant species prediction in spatiotemporal contexts by fusing graph-based features, temporal cubes processed with a Swin-Transformer backbone, and image/tabular data through a Hierarchical Cross-Attention Fusion Mechanism. The framework introduces a Graph Feature Vector (GFV) built from a Survey ID graph, employs HCAM to progressively fuse modalities with density-aware weighting, and uses MixUp with ten-fold cross-fusion training plus Threshold Top-K post-processing to handle extreme label sparsity. Ablation studies show that GFV and HCAM improve validation losses and that ten-fold cross fusion plus post-processing substantially boosts leaderboard performance, achieving a private score around 0.365 on GeoLifeCLEF 2024 tasks. The work advances practical multi-modal fusion for biodiversity monitoring, offering scalable strategies for handling imbalanced, high-dimensional ecological data and providing a blueprint for future weakly supervised and graph-based approaches.

Abstract

Predicting plant species composition in specific spatiotemporal contexts plays an important role in biodiversity management and conservation, as well as in improving species identification tools. Our work utilizes 88,987 plant survey records conducted in specific spatiotemporal contexts across Europe. We also use the corresponding satellite images, time series data, climate time series, and other rasterized environmental data such as land cover, human footprint, bioclimatic, and soil variables as training data to train the model to predict the outcomes of 4,716 plant surveys. We propose a feature construction and result correction method based on the graph structure. Through comparative experiments, we select the best-performing backbone networks for feature extraction in both temporal and image modalities. In this process, we built a backbone network based on the Swin-Transformer Block for extracting temporal Cubes features. We then design a hierarchical cross-attention mechanism capable of robustly fusing features from multiple modalities. During training, we adopt a 10-fold cross-fusion method based on fine-tuning and use a Threshold Top-K method for post-processing. Ablation experiments demonstrate the improvements in model performance brought by our proposed solution pipeline.

Tighnari: Multi-modal Plant Species Prediction Based on Hierarchical Cross-Attention Using Graph-Based and Vision Backbone-Extracted Features

TL;DR

Abstract

Paper Structure (19 sections, 14 equations, 12 figures, 3 tables, 2 algorithms)

This paper contains 19 sections, 14 equations, 12 figures, 3 tables, 2 algorithms.

Introduction
Background and related literature
Our method
Contributions
Exploratory Data Analysis
Methodology
Table Data Cleaning and Missing Value Imputation
Graph Construction and Utilization
Temporal Feature Extraction
Image Feature Extraction
Hierarchical Cross-Attention Fusion Mechanism(HCAM)
Mix up +10 Fold Cross Fusion training strategy
Post-Processing: Threshold Top-K and Output Correction
Experiments
Comparative Experiments
...and 4 more sections

Figures (12)

Figure 1: Flattened visualization of time series cubes
Figure 2: Visual comparison between NIR image and RGB image
Figure 3: Comparison of numerical feature heatmaps in different regions
Figure 4: Comparison of numerical feature boxplots in different regions
Figure 5: Visualization of survey occurrence locations in the PO,PA as well as test set
...and 7 more figures

Tighnari: Multi-modal Plant Species Prediction Based on Hierarchical Cross-Attention Using Graph-Based and Vision Backbone-Extracted Features

TL;DR

Abstract

Tighnari: Multi-modal Plant Species Prediction Based on Hierarchical Cross-Attention Using Graph-Based and Vision Backbone-Extracted Features

Authors

TL;DR

Abstract

Table of Contents

Figures (12)