FREE: The Foundational Semantic Recognition for Modeling Environmental Ecosystems
Shiyuan Luo, Juntong Ni, Shengyu Chen, Runlong Yu, Yiqun Xie, Licheng Liu, Zhenong Jin, Huaxiu Yao, Xiaowei Jia
TL;DR
This work tackles the challenge of modeling complex environmental ecosystems with heterogeneous and sparsely observed data by introducing FREE, an LLM-based framework that translates input features into natural language descriptions and performs prediction as semantic recognition in a text space. A simulation-based pre-training regime using physics-based models imbues the semantic encoder and temporal predictor with physically consistent semantics, enabling robust performance and faster fine-tuning across regions. Empirical evaluation on the Delaware River Basin stream temperature task and Illinois/Iowa corn yield demonstrates superior accuracy relative to strong baselines, especially under data sparsity, and shows effective incorporation of auxiliary observations and transferability across regions. By leveraging large language models as foundational tools for environmental modeling, FREE points toward a scalable, global approach to modeling complex ecological systems with heterogeneous data sources.
Abstract
Modeling environmental ecosystems is critical for the sustainability of our planet, but is extremely challenging due to the complex underlying processes driven by interactions amongst a large number of physical variables. As many variables are difficult to measure at large scales, existing works often utilize a combination of observable features and locally available measurements or modeled values as input to build models for a specific study region and time period. This raises a fundamental question in advancing the modeling of environmental ecosystems: how to build a general framework for modeling the complex relationships among diverse environmental variables over space and time? In this paper, we introduce a framework, FREE, that enables the use of varying features and available information to train a universal model. The core idea is to map available environmental data into a text space and then convert the traditional predictive modeling task in environmental science to a semantic recognition problem. Our evaluation on two societally important real-world applications, stream water temperature prediction and crop yield prediction, demonstrates the superiority of FREE over multiple baselines, even in data-sparse scenarios.
