Table of Contents
Fetching ...

AGILE: A Diffusion-Based Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification

Earl Ranario, Lars Lundqvist, Heesup Yun, Brian N. Bailey, J. Mason Earles

TL;DR

AGILE tackles semantic drift in cross-domain plant trait image translation by aligning source and target semantics through optimized text embeddings and attention-guided denoising in pretrained diffusion models. It constructs semantic correspondences using query-attention maps from labeled source images and applies attention editing in the first few cross-attention layers, enabling controllable, object-centric translations without paired data. Empirical results on AgML-derived grape and flower datasets show improved object detection AP in the target domain and favorable realism metrics relative to baselines, with ablations confirming the necessity of both text optimization and attention guidance. The approach offers practical value for farming AI by enabling data-efficient transfer across domain gaps, with future work on multi-object guidance and robustness to perspective and lighting changes.

Abstract

Semantically consistent cross-domain image translation facilitates the generation of training data by transferring labels across different domains, making it particularly useful for plant trait identification in agriculture. However, existing generative models struggle to maintain object-level accuracy when translating images between domains, especially when domain gaps are significant. In this work, we introduce AGILE (Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification), a diffusion-based framework that leverages optimized text embeddings and attention guidance to semantically constrain image translation. AGILE utilizes pretrained diffusion models and publicly available agricultural datasets to improve the fidelity of translated images while preserving critical object semantics. Our approach optimizes text embeddings to strengthen the correspondence between source and target images and guides attention maps during the denoising process to control object placement. We evaluate AGILE on cross-domain plant datasets and demonstrate its effectiveness in generating semantically accurate translated images. Quantitative experiments show that AGILE enhances object detection performance in the target domain while maintaining realism and consistency. Compared to prior image translation methods, AGILE achieves superior semantic alignment, particularly in challenging cases where objects vary significantly or domain gaps are substantial.

AGILE: A Diffusion-Based Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification

TL;DR

AGILE tackles semantic drift in cross-domain plant trait image translation by aligning source and target semantics through optimized text embeddings and attention-guided denoising in pretrained diffusion models. It constructs semantic correspondences using query-attention maps from labeled source images and applies attention editing in the first few cross-attention layers, enabling controllable, object-centric translations without paired data. Empirical results on AgML-derived grape and flower datasets show improved object detection AP in the target domain and favorable realism metrics relative to baselines, with ablations confirming the necessity of both text optimization and attention guidance. The approach offers practical value for farming AI by enabling data-efficient transfer across domain gaps, with future work on multi-object guidance and robustness to perspective and lighting changes.

Abstract

Semantically consistent cross-domain image translation facilitates the generation of training data by transferring labels across different domains, making it particularly useful for plant trait identification in agriculture. However, existing generative models struggle to maintain object-level accuracy when translating images between domains, especially when domain gaps are significant. In this work, we introduce AGILE (Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification), a diffusion-based framework that leverages optimized text embeddings and attention guidance to semantically constrain image translation. AGILE utilizes pretrained diffusion models and publicly available agricultural datasets to improve the fidelity of translated images while preserving critical object semantics. Our approach optimizes text embeddings to strengthen the correspondence between source and target images and guides attention maps during the denoising process to control object placement. We evaluate AGILE on cross-domain plant datasets and demonstrate its effectiveness in generating semantically accurate translated images. Quantitative experiments show that AGILE enhances object detection performance in the target domain while maintaining realism and consistency. Compared to prior image translation methods, AGILE achieves superior semantic alignment, particularly in challenging cases where objects vary significantly or domain gaps are substantial.

Paper Structure

This paper contains 18 sections, 6 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: We use labels from synthetic data to gain object domain knowledge represented by text labels by optimizing for semantic correspondences between the source and target domains.
  • Figure 2: AGILE uses pretrained diffusion models to find semantic correspondences between unpaired source and target images. We optimize text embeddings through query attention maps generated from labeled source images, guiding the model to focus on desired regions in the target domain. Attention guidance is applied during the denoising process to enhance control over semantic alignment, achieving improved consistency in translation between source and target domains.
  • Figure 3: The dataset is pulled from AgML, a machine learning library for agricultural datasets. The original synthetic images were generated from Helios, a 3D Plant and Environment Biophysical Modeling Framework bailey_helios_2019lei_simulation_2024. Synthetic images is treated as the source domain for its capability to generate an infinite amount of labeled images. We train and evaluate our method on object detection tasks and constrain the translation within the same plant.
  • Figure 4: The displayed timesteps indicate when attention guidance is halted. The optimal stopping range is between $t=5$ to $t=15$, as this preserves object structure and color effectively.
  • Figure 5: Generation results across translation tasks for our method (AGILE), CropGAN, and CycleGAN-turbo. The Target column represents the desired output domain for each translation task. Top Row: Synthetic Grape to Borden Day. Middle Row: Synthetic Grape to Borden Night. Bottom Row: Synthetic Flower to Real Flower.
  • ...and 2 more figures