AGILE: A Diffusion-Based Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification
Earl Ranario, Lars Lundqvist, Heesup Yun, Brian N. Bailey, J. Mason Earles
TL;DR
AGILE tackles semantic drift in cross-domain plant trait image translation by aligning source and target semantics through optimized text embeddings and attention-guided denoising in pretrained diffusion models. It constructs semantic correspondences using query-attention maps from labeled source images and applies attention editing in the first few cross-attention layers, enabling controllable, object-centric translations without paired data. Empirical results on AgML-derived grape and flower datasets show improved object detection AP in the target domain and favorable realism metrics relative to baselines, with ablations confirming the necessity of both text optimization and attention guidance. The approach offers practical value for farming AI by enabling data-efficient transfer across domain gaps, with future work on multi-object guidance and robustness to perspective and lighting changes.
Abstract
Semantically consistent cross-domain image translation facilitates the generation of training data by transferring labels across different domains, making it particularly useful for plant trait identification in agriculture. However, existing generative models struggle to maintain object-level accuracy when translating images between domains, especially when domain gaps are significant. In this work, we introduce AGILE (Attention-Guided Image and Label Translation for Efficient Cross-Domain Plant Trait Identification), a diffusion-based framework that leverages optimized text embeddings and attention guidance to semantically constrain image translation. AGILE utilizes pretrained diffusion models and publicly available agricultural datasets to improve the fidelity of translated images while preserving critical object semantics. Our approach optimizes text embeddings to strengthen the correspondence between source and target images and guides attention maps during the denoising process to control object placement. We evaluate AGILE on cross-domain plant datasets and demonstrate its effectiveness in generating semantically accurate translated images. Quantitative experiments show that AGILE enhances object detection performance in the target domain while maintaining realism and consistency. Compared to prior image translation methods, AGILE achieves superior semantic alignment, particularly in challenging cases where objects vary significantly or domain gaps are substantial.
