Table of Contents
Fetching ...

Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization

Xuefei, Wang, Kai A. Horstmann, Ethan Lin, Jonathan Chen, Alexander R. Farhang, Sophia Stiles, Atharva Sehgal, Jonathan Light, David Van Valen, Yisong Yue, Jennifer J. Sun

TL;DR

The paper tackles the persistent last-mile bottleneck in adapting production biomedical imaging tools to bespoke datasets. It demonstrates that a minimal Base Agent framework can consistently surpass expert-tuned baselines across Polaris, Cellpose, and MedSAM pipelines, with substantial reductions in adaptation time. Through a systematic analysis of the agent design space, it shows that more complex architectures do not universally improve performance and that task context matters for design choices. The authors provide a practical, open-source framework and validate real-world impact by deploying agent-generated functions into production, outlining a roadmap for scalable tool adaptation in biomedical imaging.

Abstract

Adapting production-level computer vision tools to bespoke scientific datasets is a critical "last mile" bottleneck. Current solutions are impractical: fine-tuning requires large annotated datasets scientists often lack, while manual code adaptation costs scientists weeks to months of effort. We consider using AI agents to automate this manual coding, and focus on the open question of optimal agent design for this targeted task. We introduce a systematic evaluation framework for agentic code optimization and use it to study three production-level biomedical imaging pipelines. We demonstrate that a simple agent framework consistently generates adaptation code that outperforms human-expert solutions. Our analysis reveals that common, complex agent architectures are not universally beneficial, leading to a practical roadmap for agent design. We open source our framework and validate our approach by deploying agent-generated functions into a production pipeline, demonstrating a clear pathway for real-world impact.

Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization

TL;DR

The paper tackles the persistent last-mile bottleneck in adapting production biomedical imaging tools to bespoke datasets. It demonstrates that a minimal Base Agent framework can consistently surpass expert-tuned baselines across Polaris, Cellpose, and MedSAM pipelines, with substantial reductions in adaptation time. Through a systematic analysis of the agent design space, it shows that more complex architectures do not universally improve performance and that task context matters for design choices. The authors provide a practical, open-source framework and validate real-world impact by deploying agent-generated functions into production, outlining a roadmap for scalable tool adaptation in biomedical imaging.

Abstract

Adapting production-level computer vision tools to bespoke scientific datasets is a critical "last mile" bottleneck. Current solutions are impractical: fine-tuning requires large annotated datasets scientists often lack, while manual code adaptation costs scientists weeks to months of effort. We consider using AI agents to automate this manual coding, and focus on the open question of optimal agent design for this targeted task. We introduce a systematic evaluation framework for agentic code optimization and use it to study three production-level biomedical imaging pipelines. We demonstrate that a simple agent framework consistently generates adaptation code that outperforms human-expert solutions. Our analysis reveals that common, complex agent architectures are not universally beneficial, leading to a practical roadmap for agent design. We open source our framework and validate our approach by deploying agent-generated functions into a production pipeline, demonstrating a clear pathway for real-world impact.

Paper Structure

This paper contains 42 sections, 4 equations, 27 figures, 5 tables.

Figures (27)

  • Figure 1: Overview. (Top) Production-level tools laubscher2024accuratestringer2025cellpose3medsamNatCom accelerate scientific discovery but face a "last mile" adaptation bottleneck. (Middle) Domain experts spend weeks to months manually coding preprocessing and postprocessing steps in order to adapt the tools to their bespoke datasets. AI agents can automate this adaptation, but it remains unclear how to navigate their complex design space to build simple, practical agents. (Bottom) Our work systematically studies the design choices of tool adaptation agents.
  • Figure 2: Comparison of expert-optimized and agent-optimized MedSAM segmentation results. Left Column: visual results, showing (Top) the raw image with the groundtruth mask and prompt (red box), (Middle) the segmentation using expert-optimized functions, and (Bottom) the segmentation using agent-generated functions. Middle & Right columns: Comparison of expert (Top) and agent-generated (Bottom) code for preprocessing and postprocessing, respectively.
  • Figure 3: Optimal API Space Characterization. We analyzed the top 20 functions from all settings, visualizing API call frequency (node size) and co-occurrence frequency (edge weight). The solution spaces for Polaris and Cellpose are highly concentrated, whereas the MedSAM space is significantly more dispersed, with high co-occurrence ratios distributed across the graph. This observation is quantitatively validated by the dispersion score (edge weight entropy), which is notably higher for MedSAM.
  • Figure 4: Optimal Parameter Space Characterization. We compared the distributions of six commonly used parameters between the Base Agent setting and found an "optimal" range from the top 20 functions across all settings. The distributions were generally similar, with the notable exception of threshold_abs in peak_local_max, a commonly used postprocessing parameter in Polaris.
  • Figure 5: Analysis of Agent-Generated Solutions. Top: Solution diversity, measured by pairwise Jaccard dissimilarity of API sets (see appendix for the formal definition). Both "Reasoning LLM" and "Add Function Bank" tend to increase diversity whereas "Add Expert Function" limits it. Bottom: Enabling the Function Bank has a different effect on solution length depending on the task. On concentrated solution spaces (Polaris, Cellpose), there is no obvious effect; however, on the dispersed solution space (MedSAM), the solutions tend to get longer.
  • ...and 22 more figures