Table of Contents
Fetching ...

U-Bench: A Comprehensive Understanding of U-Net through 100-Variant Benchmarking

Fenghe Tang, Chengqi Dong, Wenxin Ma, Zikang Xu, Heqin Zhu, Zihang Jiang, Rongsheng Wang, Yuhao Wang, Chenxu Wu, Shaohua Kevin Zhou

TL;DR

U-Bench tackles the lack of fair, large-scale benchmarks for U-Net variants in medical image segmentation by evaluating 100 U-shaped networks across 28 datasets and 10 modalities. It introduces U-Score, a deployment-oriented metric that balances accuracy and efficiency, and pairs IoU with statistical tests to assess significance, while also probing zero-shot generalization. The framework reveals that in-domain IoU gains are often marginal, but zero-shot improvements are more robust, and efficiency-focused modeling is increasingly beneficial; a model-advisor agent guides practitioners toward dataset-aware model choices. By releasing code, weights, and protocols, U-Bench provides a reproducible foundation for fair benchmarking and practical deployment in the next decade of U-Net-based segmentation research.

Abstract

Over the past decade, U-Net has been the dominant architecture in medical image segmentation, leading to the development of thousands of U-shaped variants. Despite its widespread adoption, there is still no comprehensive benchmark to systematically evaluate their performance and utility, largely because of insufficient statistical validation and limited consideration of efficiency and generalization across diverse datasets. To bridge this gap, we present U-Bench, the first large-scale, statistically rigorous benchmark that evaluates 100 U-Net variants across 28 datasets and 10 imaging modalities. Our contributions are threefold: (1) Comprehensive Evaluation: U-Bench evaluates models along three key dimensions: statistical robustness, zero-shot generalization, and computational efficiency. We introduce a novel metric, U-Score, which jointly captures the performance-efficiency trade-off, offering a deployment-oriented perspective on model progress. (2) Systematic Analysis and Model Selection Guidance: We summarize key findings from the large-scale evaluation and systematically analyze the impact of dataset characteristics and architectural paradigms on model performance. Based on these insights, we propose a model advisor agent to guide researchers in selecting the most suitable models for specific datasets and tasks. (3) Public Availability: We provide all code, models, protocols, and weights, enabling the community to reproduce our results and extend the benchmark with future methods. In summary, U-Bench not only exposes gaps in previous evaluations but also establishes a foundation for fair, reproducible, and practically relevant benchmarking in the next decade of U-Net-based segmentation models. The project can be accessed at: https://fenghetan9.github.io/ubench. Code is available at: https://github.com/FengheTan9/U-Bench.

U-Bench: A Comprehensive Understanding of U-Net through 100-Variant Benchmarking

TL;DR

U-Bench tackles the lack of fair, large-scale benchmarks for U-Net variants in medical image segmentation by evaluating 100 U-shaped networks across 28 datasets and 10 modalities. It introduces U-Score, a deployment-oriented metric that balances accuracy and efficiency, and pairs IoU with statistical tests to assess significance, while also probing zero-shot generalization. The framework reveals that in-domain IoU gains are often marginal, but zero-shot improvements are more robust, and efficiency-focused modeling is increasingly beneficial; a model-advisor agent guides practitioners toward dataset-aware model choices. By releasing code, weights, and protocols, U-Bench provides a reproducible foundation for fair benchmarking and practical deployment in the next decade of U-Net-based segmentation research.

Abstract

Over the past decade, U-Net has been the dominant architecture in medical image segmentation, leading to the development of thousands of U-shaped variants. Despite its widespread adoption, there is still no comprehensive benchmark to systematically evaluate their performance and utility, largely because of insufficient statistical validation and limited consideration of efficiency and generalization across diverse datasets. To bridge this gap, we present U-Bench, the first large-scale, statistically rigorous benchmark that evaluates 100 U-Net variants across 28 datasets and 10 imaging modalities. Our contributions are threefold: (1) Comprehensive Evaluation: U-Bench evaluates models along three key dimensions: statistical robustness, zero-shot generalization, and computational efficiency. We introduce a novel metric, U-Score, which jointly captures the performance-efficiency trade-off, offering a deployment-oriented perspective on model progress. (2) Systematic Analysis and Model Selection Guidance: We summarize key findings from the large-scale evaluation and systematically analyze the impact of dataset characteristics and architectural paradigms on model performance. Based on these insights, we propose a model advisor agent to guide researchers in selecting the most suitable models for specific datasets and tasks. (3) Public Availability: We provide all code, models, protocols, and weights, enabling the community to reproduce our results and extend the benchmark with future methods. In summary, U-Bench not only exposes gaps in previous evaluations but also establishes a foundation for fair, reproducible, and practically relevant benchmarking in the next decade of U-Net-based segmentation models. The project can be accessed at: https://fenghetan9.github.io/ubench. Code is available at: https://github.com/FengheTan9/U-Bench.

Paper Structure

This paper contains 34 sections, 14 equations, 13 figures, 22 tables.

Figures (13)

  • Figure 1: Overview of U-Bench. (A) The summary of U-Bench, which encompasses the most comprehensive large-scale evaluation of U-shaped architectures. (B) Word cloud of 100 published U-shaped variants in U-Bench Model Zoo. (C) Examples of the 28 datasets in U-Bench Data Zoo. The red / green box: in-domain / zero-shot split for evaluation. (D) Literature analysis. Among 100 recent works, 84% papers neglect zero-shot evaluation and 73% papers lack of statistical significance testing. (E) Significance analysis. Only a minority achieve statistically significant gains over U-Net. (F) Overview of a new metric, U-score. Top: IoU does not account for efficiency, while U-Score demonstrates a strong correlation with both segmentation performance and efficiency metrics. Bottom: while IoU shows a trend of saturation, U-Score highlights the yearly trends toward more efficient models. (G) The evaluation and analysis aspects covered in U-Bench.
  • Figure 2: Summary of U-shaped networks. The network comprises an encoder, a bottleneck, and a decoder with skip-connection, each of which can integrate attention gates and multi-scale fusion.
  • Figure 3: Comparison between IoU and U-Score. Red rectangle indicates the models perform better than U-Net in IoU, and green rectangle indicates the models perform better than U-Net in U-Score. (A) Across 100 variants, few methods show better IoU compared to baseline U-Net, while more than half of the methods show better U-Score. (B) The relationship between performance (IoU) and the increase in computational resources (FLOPs, parameters, FPS) is complex, whereas U-Score offers a clear distribution that effectively distinguishes favorable and unfavorable accuracy-efficiency trade-off.
  • Figure 4: Performance trends of SOTA models over the past decade. The x-axis indicates publication year, with each point marking the yearly best result. The y-axes report two evaluation metrics: IoU (left axis) and U-Score (right axis). The trend's summary is shown as arrows at the top of the y-axis, with green ones highlighting improvements and red ones indicating stagnation. Source domain performance is show at the top, and zero-shot performance is shown at the bottom.
  • Figure 5: Statistical significance analysis against U-Net across 28 datasets across 10 modalities. The outer blue pie represents the number of variants surpassing U-Net; the inner pie quantifies the statistical significance of the methods with improvements, annotated by non-significant to highly significant, with the number of works annotated in the middle. In general, in-domain improvements show limited statistical significance, while zero-shot performances show more significant improvements.
  • ...and 8 more figures