Table of Contents
Fetching ...

Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning

Heesup Yun, Isaac Kazuo Uyehara, Earl Ranario, Lars Lundqvist, Christine H. Diepenbrock, Brian N. Bailey, J. Mason Earles

TL;DR

This is the first study to utilize VLMs to generate structural JSON configurations for plant simulations, providing a scalable framework for reconstruction 3D plots for digital twin in agriculture.

Abstract

This paper introduces a synthetic benchmark to evaluate the performance of vision language models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are useful tools for simulating biophysical processes in agricultural environments, their high complexity and low throughput create bottlenecks for deployment at scale. We propose a novel approach that leverages state-of-the-art open-source VLMs -- Gemma 3 and Qwen3-VL -- to directly generate simulation parameters in JSON format from drone-based remote sensing images. Using a synthetic cowpea plot dataset generated via the Helios 3D procedural plant generation library, we tested five in-context learning methods and evaluated the models across three categories: JSON integrity, geometric evaluations, and biophysical evaluations. Our results show that while VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth, they often exhibit performance degradation due to contextual bias or rely on dataset means when visual cues are insufficient. Validation on a real-world drone orthophoto dataset and an ablation study using a blind baseline further characterize the models' reasoning capabilities versus their reliance on contextual priors. To the best of our knowledge, this is the first study to utilize VLMs to generate structural JSON configurations for plant simulations, providing a scalable framework for reconstruction 3D plots for digital twin in agriculture.

Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning

TL;DR

This is the first study to utilize VLMs to generate structural JSON configurations for plant simulations, providing a scalable framework for reconstruction 3D plots for digital twin in agriculture.

Abstract

This paper introduces a synthetic benchmark to evaluate the performance of vision language models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are useful tools for simulating biophysical processes in agricultural environments, their high complexity and low throughput create bottlenecks for deployment at scale. We propose a novel approach that leverages state-of-the-art open-source VLMs -- Gemma 3 and Qwen3-VL -- to directly generate simulation parameters in JSON format from drone-based remote sensing images. Using a synthetic cowpea plot dataset generated via the Helios 3D procedural plant generation library, we tested five in-context learning methods and evaluated the models across three categories: JSON integrity, geometric evaluations, and biophysical evaluations. Our results show that while VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth, they often exhibit performance degradation due to contextual bias or rely on dataset means when visual cues are insufficient. Validation on a real-world drone orthophoto dataset and an ablation study using a blind baseline further characterize the models' reasoning capabilities versus their reliance on contextual priors. To the best of our knowledge, this is the first study to utilize VLMs to generate structural JSON configurations for plant simulations, providing a scalable framework for reconstruction 3D plots for digital twin in agriculture.
Paper Structure (16 sections, 1 equation, 5 figures, 1 table)

This paper contains 16 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of the data-driven synthetic data generation pipeline and real-to-sim evaluation framework. (1) Data-Driven Synthetic Dataset Generation: Spatial features and structural parameters were extracted from real-world field data to procedurally synthesize high-fidelity cowpea plant plots. (2) Sim-to-Real Evaluation: The vision language model (VLM) was evaluated via few-shot in-context learning.
  • Figure 2: Multi-model evaluation metric comparisons. Blue colors represent Gemma3 teamGemma3Technical2025 models, orange colors represent baiQwen3VLTechnicalReport2025 models, and green colors represent LoRA huLoRALowRankAdaptation2021 fine-tuned Qwen3-VL models. Blue dotted lines represent mean guess baselines.
  • Figure 3: Synthetic dataset days after planting (DAP) effect on evaluation metrics. Orange colors represent baiQwen3VLTechnicalReport2025 models, and green colors represent LoRA huLoRALowRankAdaptation2021 fine-tuned Qwen3-VL models. Blue dotted lines represent mean guess baselines.
  • Figure 4: Evaluations on the synthetic dataset, the real ortho dataset, and the blind baseline from the original and fine-tuned Qwen3-VL model. Orange colors represent baiQwen3VLTechnicalReport2025 models, and green colors represent LoRA huLoRALowRankAdaptation2021 fine-tuned Qwen3-VL models. Blue dotted lines represent mean guess baselines.
  • Figure 5: Examples of simulated cowpea plot generation results based on in-context learning methods. Real images were given to Qwen3-VL 32B model to generate a cowpea plot simulation configuration, and the images were rendered by the simulation program.